ACTION REQUIRED: your account is under review
Hello, We're sorry to have to inform you that your account has been locked for your security Someone else has had access to your account, and we need to verify a few information so you could unlock your account and use it again, Please follow the instructions using the link below: https://verinow.francecentral.cloudapp.azure.com/aba_finl Please note that if you do not verify your account whithin the next 24 hours, we will have to terminate it permenatly Best Regards
Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead
On Sat, Dec 08, 2018 at 08:50:32AM +0800, Qu Wenruo wrote: > > I've adapted a stress tests that unpacks a large tarball, snaphosts > > every 20 seconds, deletes a random snapshot every 50 seconds, deletes > > file from the original subvolume, now enhanced with qgroups just for the > > new snapshots inherigin the toplevel subvolume. Lockup. > > > > It gets stuck in a snapshot call with the follwin stacktrace > > > > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs] > > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs] > > This looks like the original subtree tracing has something wrong. Yes, I ran the test on current master and it locked up too, so it's not due to your patchset. > Thanks for the report, I'll investigate it. Thanks.
Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead
On 2018/12/8 上午8:47, David Sterba wrote: > On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote: >> >> >> On 2018/12/7 上午3:35, David Sterba wrote: >>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote: On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote: > This patchset can be fetched from github: > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased > > Which is based on v4.20-rc1. Thanks, I'll add it to for-next soon. >>> >>> The branch was there for some time but not for at least a week (my >>> mistake I did not notice in time). I've rebased it on top of recent >>> misc-next, but without the delayed refs patchset from Josef. >>> >>> At the moment I'm considering it for merge to 4.21, there's still some >>> time to pull it out in case it shows up to be too problematic. I'm >>> mostly worried about the unknown interactions with the enospc updates or >> >> For that part, I don't think it would have some obvious problem for >> enospc updates. >> >> As the user-noticeable effect is the delay of reloc tree deletion. >> >> Despite that, it's mostly transparent to extent allocation. >> >>> generally because of lack of qgroup and reloc code reviews. >> >> That's the biggest problem. >> >> However most of the current qgroup + balance optimization is done inside >> qgroup code (to skip certain qgroup record), if we're going to hit some >> problem then this patchset would have the highest possibility to hit >> problem. >> >> Later patches will just keep tweaking qgroup to without affecting any >> other parts mostly. >> >> So I'm fine if you decide to pull it out for now. > > I've adapted a stress tests that unpacks a large tarball, snaphosts > every 20 seconds, deletes a random snapshot every 50 seconds, deletes > file from the original subvolume, now enhanced with qgroups just for the > new snapshots inherigin the toplevel subvolume. Lockup. > > It gets stuck in a snapshot call with the follwin stacktrace > > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs] > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs] This looks like the original subtree tracing has something wrong. Thanks for the report, I'll investigate it. Qu > [<0>] do_walk_down+0x681/0xb20 [btrfs] > [<0>] walk_down_tree+0xf5/0x1c0 [btrfs] > [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs] > [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs] > [<0>] cleaner_kthread+0xf8/0x170 [btrfs] > [<0>] kthread+0x121/0x140 > [<0>] ret_from_fork+0x27/0x50 > > and that's like 10th snapshot and ~3rd deltion. This is qgroup show: > > qgroupid rfer excl parent > -- > 0/5 865.27MiB 1.66MiB --- > 0/257 0.00B0.00B --- > 0/259 0.00B0.00B --- > 0/260 806.58MiB637.25MiB --- > 0/262 0.00B0.00B --- > 0/263 0.00B0.00B --- > 0/264 0.00B0.00B --- > 0/265 0.00B0.00B --- > 0/266 0.00B0.00B --- > 0/267 0.00B0.00B --- > 0/268 0.00B0.00B --- > 0/269 0.00B0.00B --- > 0/270 989.04MiB 1.22MiB --- > 0/271 0.00B0.00B --- > 0/272 922.25MiB416.00KiB --- > 0/273 931.02MiB 1.50MiB --- > 0/274 910.94MiB 1.52MiB --- > 1/1 1.64GiB 1.64GiB > 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274 > > No IO or cpu activity at this point, the stacktrace and show output > remains the same. > > So, considering this, I'm not going to add the patchset to 4.21 but will > keep it in for-next for testing, any fixups or updates will be applied. > signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead
On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote: > > > On 2018/12/7 上午3:35, David Sterba wrote: > > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote: > >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote: > >>> This patchset can be fetched from github: > >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased > >>> > >>> Which is based on v4.20-rc1. > >> > >> Thanks, I'll add it to for-next soon. > > > > The branch was there for some time but not for at least a week (my > > mistake I did not notice in time). I've rebased it on top of recent > > misc-next, but without the delayed refs patchset from Josef. > > > > At the moment I'm considering it for merge to 4.21, there's still some > > time to pull it out in case it shows up to be too problematic. I'm > > mostly worried about the unknown interactions with the enospc updates or > > For that part, I don't think it would have some obvious problem for > enospc updates. > > As the user-noticeable effect is the delay of reloc tree deletion. > > Despite that, it's mostly transparent to extent allocation. > > > generally because of lack of qgroup and reloc code reviews. > > That's the biggest problem. > > However most of the current qgroup + balance optimization is done inside > qgroup code (to skip certain qgroup record), if we're going to hit some > problem then this patchset would have the highest possibility to hit > problem. > > Later patches will just keep tweaking qgroup to without affecting any > other parts mostly. > > So I'm fine if you decide to pull it out for now. I've adapted a stress tests that unpacks a large tarball, snaphosts every 20 seconds, deletes a random snapshot every 50 seconds, deletes file from the original subvolume, now enhanced with qgroups just for the new snapshots inherigin the toplevel subvolume. Lockup. It gets stuck in a snapshot call with the follwin stacktrace [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs] [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs] [<0>] do_walk_down+0x681/0xb20 [btrfs] [<0>] walk_down_tree+0xf5/0x1c0 [btrfs] [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs] [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs] [<0>] cleaner_kthread+0xf8/0x170 [btrfs] [<0>] kthread+0x121/0x140 [<0>] ret_from_fork+0x27/0x50 and that's like 10th snapshot and ~3rd deltion. This is qgroup show: qgroupid rfer excl parent -- 0/5 865.27MiB 1.66MiB --- 0/257 0.00B0.00B --- 0/259 0.00B0.00B --- 0/260 806.58MiB637.25MiB --- 0/262 0.00B0.00B --- 0/263 0.00B0.00B --- 0/264 0.00B0.00B --- 0/265 0.00B0.00B --- 0/266 0.00B0.00B --- 0/267 0.00B0.00B --- 0/268 0.00B0.00B --- 0/269 0.00B0.00B --- 0/270 989.04MiB 1.22MiB --- 0/271 0.00B0.00B --- 0/272 922.25MiB416.00KiB --- 0/273 931.02MiB 1.50MiB --- 0/274 910.94MiB 1.52MiB --- 1/1 1.64GiB 1.64GiB 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274 No IO or cpu activity at this point, the stacktrace and show output remains the same. So, considering this, I'm not going to add the patchset to 4.21 but will keep it in for-next for testing, any fixups or updates will be applied.
Re: [PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support
On Thu, Dec 06, 2018 at 04:29:32PM -0800, Omar Sandoval wrote: > From: Omar Sandoval > > I apparently didn't test this on a pre-4.18 kernel. > test_subvolume_info_unprivileged() checks for an ENOTTY, but this > doesn't seem to work correctly with subTest(). > test_subvolume_iterator_unprivileged() doesn't have a check at all. Add > an explicit check to both before doing the actual test. > > Signed-off-by: Omar Sandoval Applied, thanks.
Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
On 6.12.18 г. 8:58 ч., Qu Wenruo wrote: > The process_func is never a function hook used anywhere else. > > Open code it to make later delayed ref refactor easier, so we can > refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different > patches. > > Signed-off-by: Qu Wenruo Reviewed-by: Nikolay Borisov > --- > fs/btrfs/extent-tree.c | 33 ++--- > 1 file changed, 18 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index ea2c3d5220f0..ea68d288d761 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle > *trans, > int i; > int level; > int ret = 0; > - int (*process_func)(struct btrfs_trans_handle *, > - struct btrfs_root *, > - u64, u64, u64, u64, u64, u64, bool); > - > > if (btrfs_is_testing(fs_info)) > return 0; > @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle > *trans, > if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0) > return 0; > > - if (inc) > - process_func = btrfs_inc_extent_ref; > - else > - process_func = btrfs_free_extent; > - > if (full_backref) > parent = buf->start; > else > @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle > *trans, > > num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi); > key.offset -= btrfs_file_extent_offset(buf, fi); > - ret = process_func(trans, root, bytenr, num_bytes, > -parent, ref_root, key.objectid, > -key.offset, for_reloc); > + if (inc) > + ret = btrfs_inc_extent_ref(trans, root, bytenr, > + num_bytes, parent, ref_root, > + key.objectid, key.offset, > + for_reloc); > + else > + ret = btrfs_free_extent(trans, root, bytenr, > + num_bytes, parent, ref_root, > + key.objectid, key.offset, > + for_reloc); > if (ret) > goto fail; > } else { > bytenr = btrfs_node_blockptr(buf, i); > num_bytes = fs_info->nodesize; > - ret = process_func(trans, root, bytenr, num_bytes, > -parent, ref_root, level - 1, 0, > -for_reloc); > + if (inc) > + ret = btrfs_inc_extent_ref(trans, root, bytenr, > + num_bytes, parent, ref_root, > + level - 1, 0, for_reloc); > + else > + ret = btrfs_free_extent(trans, root, bytenr, > + num_bytes, parent, ref_root, > + level - 1, 0, for_reloc); > if (ret) > goto fail; > } >
Re: System unable to mount partition after a power loss
I ran that command and I cannot get the email to send properly to the mailing list as the attachment of the output is over 4.6M. On 12/7/2018 11:49 AM, Doni Crosby wrote: The output of the command is attached. This is what errors showed up on the system: parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 Ignoring transid failure parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 Ignoring transid failure parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 Ignoring transid failure WARNING: eb corrupted: parent bytenr 31801344 slot 132 level 1 child bytenr 3563231461376 level has 1 expect 0, skipping the slot parent transid verify failed on 3563231494144 wanted 5184691 found 5183325 parent transid verify failed on 3563231494144 wanted 5184691 found 5183325 parent transid verify failed on 3563231494144 wanted 5184691 found 5183325 parent transid verify failed on 3563231494144 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563231526912 wanted 5184691 found 5183325 parent transid verify failed on 3563231526912 wanted 5184691 found 5183325 parent transid verify failed on 3563231526912 wanted 5184691 found 5183325 parent transid verify failed on 3563231526912 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563229626368 wanted 5184691 found 5184689 parent transid verify failed on 3563229626368 wanted 5184691 found 5184689 parent transid verify failed on 3563229937664 wanted 5184691 found 5184689 parent transid verify failed on 3563229937664 wanted 5184691 found 5184689 parent transid verify failed on 3563226857472 wanted 5184691 found 5184689 parent transid verify failed on 3563226857472 wanted 5184691 found 5184689 parent transid verify failed on 3563230674944 wanted 5184691 found 5183325 parent transid verify failed on 3563230674944 wanted 5184691 found 5183325 parent transid verify failed on 3563230674944 wanted 5184691 found 5183325 parent transid verify failed on 3563230674944 wanted 5184691 found 5183325 Ignoring transid failure On Fri, Dec 7, 2018 at 2:22 AM Qu Wenruo wrote: On 2018/12/7 下午1:24, Doni Crosby wrote: All, I'm coming to you to see if there is a way to fix or at least recover most of the data I have from a btrfs filesystem. The system went down after both a breaker and the battery backup failed. I cannot currently mount the system, with the following error from dmesg: Note: The vda1 is just the entire disk being passed from the VM host to the VM it's not an actual true virtual block device [ 499.704398] BTRFS info (device vda1): disk space caching is enabled [ 499.704401] BTRFS info (device vda1): has skinny extents [ 499.739522] BTRFS error (device vda1): parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 Transid mismatch normally means the fs is screwed up more or less. And according to your mount failure, it looks the fs get screwed up badly. What's the kernel version used in the VM? I don't really think the VM is always using the latest kernel. [ 499.740257] BTRFS error (device vda1): parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Re: System unable to mount partition after a power loss
I just looked at the VM it does not have a cache. That's the default in proxmox to improve performance. On Fri, Dec 7, 2018 at 7:25 AM Austin S. Hemmelgarn wrote: > > On 2018-12-07 01:43, Doni Crosby wrote: > >> This is qemu-kvm? What's the cache mode being used? It's possible the > >> usual write guarantees are thwarted by VM caching. > > Yes it is a proxmox host running the system so it is a qemu vm, I'm > > unsure on the caching situation. > On the note of QEMU and the cache mode, the only cache mode I've seen to > actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', > but that causes problems for most filesystems, so it's probably not the > issue here. > > OTOH, I've seen issues with most of the cache modes other than > 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as > the back-end storage on the host system, and most of the time such > issues will manifest as both problems with the volume inside the VM > _and_ the volume the disk images are being stored on.
[PATCH v2] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication
From: Filipe Manana Since cloning and deduplication are no longer Btrfs specific operations, we now have generic code to handle parameter validation, compare file ranges used for deduplication, clear capabilities when cloning, etc. This change makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a few bugs, such as: 1) When cloning, the destination file's capabilities were not dropped (the fstest generic/513 tests this); 2) We were not checking if the destination file is immutable; 3) Not checking if either the source or destination files are swap files (swap file support is coming soon for Btrfs); 4) System limits were not checked (resource limits and O_LARGEFILE). Note that the generic helper generic_remap_file_range_prep() does start and waits for writeback by calling filemap_write_and_wait_range(), however that is not enough for Btrfs for two reasons: 1) With compression, we need to start writeback twice in order to get the pages marked for writeback and ordered extents created; 2) filemap_write_and_wait_range() (and all its other variants) only waits for the IO to complete, but we need to wait for the ordered extents to finish, so that when we do the actual reflinking operations the file extent items are in the fs tree. This is also important due to the fact that the generic helper, for the deduplication case, compares the contents of the pages in the requested range, which might require reading extents from disk in the very unlikely case that pages get invalidated after writeback finishes (so the file extent items must be up to date in the fs tree). Since these reasons are specific to Btrfs we have to do it in the Btrfs code before calling generic_remap_file_range_prep(). This also results in a more simple way of dealing with existing delalloc in the source/target ranges, specially for the deduplication case where we used to lock all the pages first and then if we found any dealloc for the range, or ordered extent, we would unlock the pages trigger writeback and wait for ordered extents to complete, then lock all the pages again and check if deduplication can be done. So now we get a simpler approach: lock the inodes, then trigger writeback and then wait for ordered extents to complete. So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it) to eliminate duplicated code, fix a few bugs and benefit from future bug fixes done there - for example the recent clone and dedupe bugs involving reflinking a partial EOF block got a counterpart fix in the generic helpe, since it affected all filesystems supporting these operations, so we no longer need special checks in Btrfs for them. Signed-off-by: Filipe Manana --- V2: Removed check that verifies if either of the inodes is a directory, as it is done by generic_remap_file_range_prep(). Oddly in btrfs was being done only for cloning but not for dedupe. fs/btrfs/ioctl.c | 612 --- 1 file changed, 129 insertions(+), 483 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 802a628e9f7d..321fb9bc149d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info *fs_info, return ret; } -static struct page *extent_same_get_page(struct inode *inode, pgoff_t index) -{ - struct page *page; - - page = grab_cache_page(inode->i_mapping, index); - if (!page) - return ERR_PTR(-ENOMEM); - - if (!PageUptodate(page)) { - int ret; - - ret = btrfs_readpage(NULL, page); - if (ret) - return ERR_PTR(ret); - lock_page(page); - if (!PageUptodate(page)) { - unlock_page(page); - put_page(page); - return ERR_PTR(-EIO); - } - if (page->mapping != inode->i_mapping) { - unlock_page(page); - put_page(page); - return ERR_PTR(-EAGAIN); - } - } - - return page; -} - -static int gather_extent_pages(struct inode *inode, struct page **pages, - int num_pages, u64 off) -{ - int i; - pgoff_t index = off >> PAGE_SHIFT; - - for (i = 0; i < num_pages; i++) { -again: - pages[i] = extent_same_get_page(inode, index + i); - if (IS_ERR(pages[i])) { - int err = PTR_ERR(pages[i]); - - if (err == -EAGAIN) - goto again; - pages[i] = NULL; - return err; - } - } - return 0; -} - -static int lock_extent_range(struct inode *inode, u64 off, u64 len, -bool retry_range_locking) -{ - /* -* Do any pending
Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > From: Josef Bacik > > Traditionally we've had voodoo in btrfs to account for the space that > delayed refs may take up by having a global_block_rsv. This works most > of the time, except when it doesn't. We've had issues reported and seen > in production where sometimes the global reserve is exhausted during > transaction commit before we can run all of our delayed refs, resulting > in an aborted transaction. Because of this voodoo we have equally > dubious flushing semantics around throttling delayed refs which we often > get wrong. > > So instead give them their own block_rsv. This way we can always know > exactly how much outstanding space we need for delayed refs. This > allows us to make sure we are constantly filling that reservation up > with space, and allows us to put more precise pressure on the enospc > system. Instead of doing math to see if its a good time to throttle, > the normal enospc code will be invoked if we have a lot of delayed refs > pending, and they will be run via the normal flushing mechanism. > > For now the delayed_refs_rsv will hold the reservations for the delayed > refs, the block group updates, and deleting csums. We could have a > separate rsv for the block group updates, but the csum deletion stuff is > still handled via the delayed_refs so that will stay there. I see one difference in the way that the space is managed. Essentially for delayed refs rsv you'll only ever be increasaing the size and ->reserved only when you have to refill. This is opposite to the way other metadata space is managed i.e by using use_block_rsv which subtracts ->reserved everytime a block has to be CoW'ed. Why this difference? > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.h | 14 +++- > fs/btrfs/delayed-ref.c | 43 -- > fs/btrfs/disk-io.c | 4 + > fs/btrfs/extent-tree.c | 212 > + > fs/btrfs/transaction.c | 37 - > 5 files changed, 284 insertions(+), 26 deletions(-) > > +/** > + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs > rsv. > + * @fs_info - the fs info for our fs. > + * @src - the source block rsv to transfer from. > + * @num_bytes - the number of bytes to transfer. > + * > + * This transfers up to the num_bytes amount from the src rsv to the > + * delayed_refs_rsv. Any extra bytes are returned to the space info. > + */ > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > +struct btrfs_block_rsv *src, > +u64 num_bytes) This function is currently used only during transaction start, it seems to be rather specific to the delayed refs so I'd suggest making it private to transaction.c > +{ > + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv; > + u64 to_free = 0; > + > + spin_lock(>lock); > + src->reserved -= num_bytes; > + src->size -= num_bytes; > + spin_unlock(>lock); > + > + spin_lock(_refs_rsv->lock); > + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) { > + u64 delta = delayed_refs_rsv->size - > + delayed_refs_rsv->reserved; > + if (num_bytes > delta) { > + to_free = num_bytes - delta; > + num_bytes = delta; > + } > + } else { > + to_free = num_bytes; > + num_bytes = 0; > + } > + > + if (num_bytes) > + delayed_refs_rsv->reserved += num_bytes; > + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size) > + delayed_refs_rsv->full = 1; > + spin_unlock(_refs_rsv->lock); > + > + if (num_bytes) > + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv", > + 0, num_bytes, 1); > + if (to_free) > + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info, > + to_free); > +} > + > +/** > + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage. > + * @fs_info - the fs_info for our fs. > + * @flush - control how we can flush for this reservation. > + * > + * This will refill the delayed block_rsv up to 1 items size worth of space > and > + * will return -ENOSPC if we can't make the reservation. > + */ > +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, > + enum btrfs_reserve_flush_enum flush) > +{ > + struct btrfs_block_rsv *block_rsv = _info->delayed_refs_rsv; > + u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1); > + u64 num_bytes = 0; > + int ret = -ENOSPC; > + > + spin_lock(_rsv->lock); > + if (block_rsv->reserved < block_rsv->size) { > + num_bytes = block_rsv->size - block_rsv->reserved; > + num_bytes = min(num_bytes, limit); > + } > +
Urgently need money? We can help you!
Urgently need money? We can help you! Are you by the current situation in trouble or threatens you in trouble? In this way, we give you the ability to take a new development. As a rich person I feel obliged to assist people who are struggling to give them a chance. Everyone deserved a second chance and since the Government fails, it will have to come from others. No amount is too crazy for us and the maturity we determine by mutual agreement. No surprises, no extra costs, but just the agreed amounts and nothing else. Don't wait any longer and comment on this post. Please specify the amount you want to borrow and we will contact you with all the possibilities. contact us today at stewarrt.l...@gmail.com
[PATCH] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication
From: Filipe Manana Since cloning and deduplication are no longer Btrfs specific operations, we now have generic code to handle parameter validation, compare file ranges used for deduplication, clear capabilities when cloning, etc. This change makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a few bugs, such as: 1) When cloning, the destination file's capabilities were not dropped (the fstest generic/513 tests this); 2) We were not checking if the destination file is immutable; 3) Not checking if either the source or destination files are swap files (swap file support is coming soon for Btrfs); 4) System limits were not checked (resource limits and O_LARGEFILE). Note that the generic helper generic_remap_file_range_prep() does start and waits for writeback by calling filemap_write_and_wait_range(), however that is not enough for Btrfs for two reasons: 1) With compression, we need to start writeback twice in order to get the pages marked for writeback and ordered extents created; 2) filemap_write_and_wait_range() (and all its other variants) only waits for the IO to complete, but we need to wait for the ordered extents to finish, so that when we do the actual reflinking operations the file extent items are in the fs tree. This is also important due to the fact that the generic helper, for the deduplication case, compares the contents of the pages in the requested range, which might require reading extents from disk in the very unlikely case that pages get invalidated after writeback finishes (so the file extent items must be up to date in the fs tree). Since these reasons are specific to Btrfs we have to do it in the Btrfs code before calling generic_remap_file_range_prep(). This also results in a more simple way of dealing with existing delalloc in the source/target ranges, specially for the deduplication case where we used to lock all the pages first and then if we found any dealloc for the range, or ordered extent, we would unlock the pages trigger writeback and wait for ordered extents to complete, then lock all the pages again and check if deduplication can be done. So now we get a simpler approach: lock the inodes, then trigger writeback and then wait for ordered extents to complete. So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it) to eliminate duplicated code, fix a few bugs and benefit from future bug fixes done there - for example the recent clone and dedupe bugs involving reflinking a partial EOF block got a counterpart fix in the generic helpe, since it affected all filesystems supporting these operations, so we no longer need special checks in Btrfs for them. Signed-off-by: Filipe Manana --- fs/btrfs/ioctl.c | 615 --- 1 file changed, 132 insertions(+), 483 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 802a628e9f7d..261e116dddb2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info *fs_info, return ret; } -static struct page *extent_same_get_page(struct inode *inode, pgoff_t index) -{ - struct page *page; - - page = grab_cache_page(inode->i_mapping, index); - if (!page) - return ERR_PTR(-ENOMEM); - - if (!PageUptodate(page)) { - int ret; - - ret = btrfs_readpage(NULL, page); - if (ret) - return ERR_PTR(ret); - lock_page(page); - if (!PageUptodate(page)) { - unlock_page(page); - put_page(page); - return ERR_PTR(-EIO); - } - if (page->mapping != inode->i_mapping) { - unlock_page(page); - put_page(page); - return ERR_PTR(-EAGAIN); - } - } - - return page; -} - -static int gather_extent_pages(struct inode *inode, struct page **pages, - int num_pages, u64 off) -{ - int i; - pgoff_t index = off >> PAGE_SHIFT; - - for (i = 0; i < num_pages; i++) { -again: - pages[i] = extent_same_get_page(inode, index + i); - if (IS_ERR(pages[i])) { - int err = PTR_ERR(pages[i]); - - if (err == -EAGAIN) - goto again; - pages[i] = NULL; - return err; - } - } - return 0; -} - -static int lock_extent_range(struct inode *inode, u64 off, u64 len, -bool retry_range_locking) -{ - /* -* Do any pending delalloc/csum calculations on inode, one way or -* another, and lock file content. -* The locking order is: -* -* 1) pages -* 2) range in the inode's io tree -
[PATCH] Btrfs: scrub, move setup of nofs contexts higher in the stack
From: Filipe Manana Since scrub workers only do memory allocation with GFP_KERNEL when they need to perform repair, we can move the recent setup of the nofs context up to scrub_handle_errored_block() instead of setting it up down the call chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(), removing some duplicate code and comment. So the only paths for which a scrub worker can do memory allocations using GFP_KERNEL are the following: scrub_bio_end_io_worker() scrub_block_complete() scrub_handle_errored_block() lock_full_stripe() insert_full_stripe_lock() -> kmalloc with GFP_KERNEL scrub_bio_end_io_worker() scrub_block_complete() scrub_handle_errored_block() scrub_write_page_to_dev_replace() scrub_add_page_to_wr_bio() -> kzalloc with GFP_KERNEL Signed-off-by: Filipe Manana --- Applies on top of: Btrfs: fix deadlock with memory reclaim during scrub fs/btrfs/scrub.c | 34 ++ 1 file changed, 14 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index bbd1b36f4918..f996f4064596 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -322,7 +322,6 @@ static struct full_stripe_lock *insert_full_stripe_lock( struct rb_node *parent = NULL; struct full_stripe_lock *entry; struct full_stripe_lock *ret; - unsigned int nofs_flag; lockdep_assert_held(_root->lock); @@ -342,15 +341,8 @@ static struct full_stripe_lock *insert_full_stripe_lock( /* * Insert new lock. -* -* We must use GFP_NOFS because the scrub task might be waiting for a -* worker task executing this function and in turn a transaction commit -* might be waiting the scrub task to pause (which needs to wait for all -* the worker tasks to complete before pausing). */ - nofs_flag = memalloc_nofs_save(); ret = kmalloc(sizeof(*ret), GFP_KERNEL); - memalloc_nofs_restore(nofs_flag); if (!ret) return ERR_PTR(-ENOMEM); ret->logical = fstripe_logical; @@ -842,6 +834,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) int page_num; int success; bool full_stripe_locked; + unsigned int nofs_flag; static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -867,6 +860,16 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) dev = sblock_to_check->pagev[0]->dev; /* +* We must use GFP_NOFS because the scrub task might be waiting for a +* worker task executing this function and in turn a transaction commit +* might be waiting the scrub task to pause (which needs to wait for all +* the worker tasks to complete before pausing). +* We do allocations in the workers through insert_full_stripe_lock() +* and scrub_add_page_to_wr_bio(), which happens down the call chain of +* this function. +*/ + nofs_flag = memalloc_nofs_save(); + /* * For RAID5/6, race can happen for a different device scrub thread. * For data corruption, Parity and Data threads will both try * to recovery the data. @@ -875,6 +878,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) */ ret = lock_full_stripe(fs_info, logical, _stripe_locked); if (ret < 0) { + memalloc_nofs_restore(nofs_flag); spin_lock(>stat_lock); if (ret == -ENOMEM) sctx->stat.malloc_errors++; @@ -914,7 +918,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) */ sblocks_for_recheck = kcalloc(BTRFS_MAX_MIRRORS, - sizeof(*sblocks_for_recheck), GFP_NOFS); + sizeof(*sblocks_for_recheck), GFP_KERNEL); if (!sblocks_for_recheck) { spin_lock(>stat_lock); sctx->stat.malloc_errors++; @@ -1212,6 +1216,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) } ret = unlock_full_stripe(fs_info, logical, full_stripe_locked); + memalloc_nofs_restore(nofs_flag); if (ret < 0) return ret; return 0; @@ -1630,19 +1635,8 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, mutex_lock(>wr_lock); again: if (!sctx->wr_curr_bio) { - unsigned int nofs_flag; - - /* -* We must use GFP_NOFS because the scrub task might be waiting -* for a worker task executing this function and in turn a -* transaction commit might be waiting the scrub task to pause -* (which needs to wait for all the worker tasks
Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > From: Josef Bacik > > We use this number to figure out how many delayed refs to run, but > __btrfs_run_delayed_refs really only checks every time we need a new > delayed ref head, so we always run at least one ref head completely no > matter what the number of items on it. Fix the accounting to only be > adjusted when we add/remove a ref head. David, I think also warrants a forward looking sentence stating that the number is also going to be used to calculate the required number of bytes in the delayed refs rsv. Something along the lines of: In addition to using this number to limit the number of delayed refs run, a future patch is also going to use it to calculate the amount of space required for delayed refs space reservation. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Josef Bacik > --- > fs/btrfs/delayed-ref.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index b3e4c9fcb664..48725fa757a3 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct > btrfs_trans_handle *trans, > ref->in_tree = 0; > btrfs_put_delayed_ref(ref); > atomic_dec(_refs->num_entries); > - if (trans->delayed_ref_updates) > - trans->delayed_ref_updates--; > } > > static bool merge_ref(struct btrfs_trans_handle *trans, > @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle > *trans, > if (ref->action == BTRFS_ADD_DELAYED_REF) > list_add_tail(>add_list, >ref_add_list); > atomic_inc(>num_entries); > - trans->delayed_ref_updates++; > spin_unlock(>lock); > return ret; > } >
Re: System unable to mount partition after a power loss
On 2018-12-07 01:43, Doni Crosby wrote: This is qemu-kvm? What's the cache mode being used? It's possible the usual write guarantees are thwarted by VM caching. Yes it is a proxmox host running the system so it is a qemu vm, I'm unsure on the caching situation. On the note of QEMU and the cache mode, the only cache mode I've seen to actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', but that causes problems for most filesystems, so it's probably not the issue here. OTOH, I've seen issues with most of the cache modes other than 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as the back-end storage on the host system, and most of the time such issues will manifest as both problems with the volume inside the VM _and_ the volume the disk images are being stored on.
Re: What if TRIM issued a wipe on devices that don't TRIM?
On 2018-12-06 23:09, Andrei Borzenkov wrote: 06.12.2018 16:04, Austin S. Hemmelgarn пишет: * On SCSI devices, a discard operation translates to a SCSI UNMAP command. As pointed out by Ronnie Sahlberg in his reply, this command is purely advisory, may not result in any actual state change on the target device, and is not guaranteed to wipe the data. To actually wipe things, you have to explicitly write bogus data to the given regions (using either regular writes, or a WRITESAME command with the desired pattern), and _then_ call UNMAP on them. WRITE SAME command has UNMAP bit and depending on device and kernel version kernel may actually issue either UNMAP or WRITE SAME with UNMAP bit set when doing discard. Good to know. I've not looked at the SCSI code much, and actually didn't know about the UNMAP bit for the WRITE SAME command, so I just assumed that the kernel only used the UNMAP command.
Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote: > > > On 6.12.18 г. 19:54 ч., David Sterba wrote: >> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote: >>> >>> >>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote: Now with the delayed_refs_rsv we can now know exactly how much pending delayed refs space we need. This means we can drastically simplify >>> >>> IMO it will be helpful if there is a sentence here referring back to >>> btrfs_update_delayed_refs_rsv to put your first sentence into context. >>> But I guess this is something David can also do. >> >> I'll update the changelog, but I'm not sure what exactly you want to see >> there, please post the replacement text. Thanks. > > With the introduction of dealyed_refs_rsv infrastructure, namely > btrfs_update_delayed_refs_rsv we now know exactly how much pending > delayed refs space is required. To put things into context as to why I deem this change beneficial - basically doing the migration of reservation from transaction to delayed refs rsv modifies both size and reserved - they will be equal. Calling btrfs_update_delayed_refs_rsv actually increases ->size and doesn't really decrement ->reserved. Also we never do btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so managing ->reserved value for delayed refs rsv is different than for the rest of the block rsv. > >> btrfs_check_space_for_delayed_refs by simply checking how much space we have reserved for the global rsv (which acts as a spill over buffer) and the delayed refs rsv. If our total size is beyond that amount then we know it's time to commit the transaction and stop any more delayed refs from being generated. Signed-off-by: Josef Bacik >> >
Re: HELP unmountable partition after btrfs balance to RAID0
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted: > We wanted to convert a file system to a RAID0 with two partitions. > Unfortunately we had to reboot the server during the balance operation > before it could complete. > > Now following happens: > > A mount attempt of the array fails with following error code: > > btrfs recover yields roughly 1.6 out of 4 TB. [Just another btrfs user and list regular, not a dev. A dev may reply to your specific case, but meanwhile, for next time...] That shouldn't be a problem. Because with raid0 a failure of any of the components will take down the entire raid, making it less reliable than a single device, raid0 (in general, not just btrfs) is considered only useful for data of low enough value that its loss is no big deal, either because it's truly of little value (internet cache being a good example), or because backups are kept available and updated for whenever the raid0 array fails. Because with raid0, it's always a question of when it'll fail, not if. So loss of a filesystem being converted to raid0 isn't a problem, because the data on it, by virtue of being in the process of conversion to raid0, is defined as of throw-away value in any case. If it's of higher value than that, it's not going to be raid0 (or in the process of conversion to it) in the first place. Of course that's simply an extension of the more general first sysadmin's rule of backups, that the true value of data is defined not by arbitrary claims, but by the number of backups of that data it's worth having. Because "things happen", whether it's fat-fingering, bad hardware, buggy software, or simply someone tripping over the power cable or running into the power pole outside at the wrong time. So no backup is simply defining the data as worth less than the time/ trouble/resources necessary to make that backup. Note that you ALWAYS save what was of most value to you, either the time/ trouble/resources to do the backup, if your actions defined that to be of more value than the data, or the data, if you had that backup, thereby defining the value of the data to be worth backing up. Similarly, failure of the only backup isn't a problem because by virtue of there being only that one backup, the data is defined as not worth having more than one, and likewise, having an outdated backup isn't a problem, because that's simply the special case of defining the data in the delta between the backup time and the present as not (yet) worth the time/hassle/resources to make/refresh that backup. (And FWIW, the second sysadmin's rule of backups is that it's not a backup until you've successfully tested it recoverable in the same sort of conditions you're likely to need to recover it in. Because so many people have /thought/ they had backups, that turned out not to be, because they never tested that they could actually recover the data from them. For instance, if the backup tools you'll need to recover the backup are on the backup itself, how do you get to them? Can you create a filesystem for the new copy of the data and recover it from the backup with just the tools and documentation available from your emergency boot media? Untested backup == no backup, or at best, backup still in process!) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: System unable to mount partition after a power loss
On 2018/12/7 下午1:24, Doni Crosby wrote: > All, > > I'm coming to you to see if there is a way to fix or at least recover > most of the data I have from a btrfs filesystem. The system went down > after both a breaker and the battery backup failed. I cannot currently > mount the system, with the following error from dmesg: > > Note: The vda1 is just the entire disk being passed from the VM host > to the VM it's not an actual true virtual block device > > [ 499.704398] BTRFS info (device vda1): disk space caching is enabled > [ 499.704401] BTRFS info (device vda1): has skinny extents > [ 499.739522] BTRFS error (device vda1): parent transid verify failed > on 3563231428608 wanted 5184691 found 5183327 Transid mismatch normally means the fs is screwed up more or less. And according to your mount failure, it looks the fs get screwed up badly. What's the kernel version used in the VM? I don't really think the VM is always using the latest kernel. > [ 499.740257] BTRFS error (device vda1): parent transid verify failed > on 3563231428608 wanted 5184691 found 5183327 > [ 499.770847] BTRFS error (device vda1): open_ctree failed > > I have tried running btrfsck: > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 > parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 > parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 > parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 > parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 > parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 > parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 > parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 > parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 > parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 > parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 > parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 > parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 > parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 > parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 > parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 > parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 According to your later dump-super output, it looks pretty possible that the corrupted extents are all belonging to extent tree. So it's still possible that your fs tree and other essential trees are OK. Please dump the following output (with its stderr) to further confirm the damage. # btrfs ins dump-tree -b 31801344 --follow /dev/vda1 If your objective is only to recover data, then you could start to try btrfs-restore. It's pretty hard to fix the heavily damaged extent tree. Thanks, Qu > Ignoring transid failure > Checking filesystem on /dev/vda1 > UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6 > checking extents > parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 > parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 > parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 > parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 > parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 > parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 > parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 > parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 > parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 > parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 > Ignoring transid failure > parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 > parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 > parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 > parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 > Ignoring transid failure > parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 > parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 > parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 > parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 > Ignoring transid failure > parent transid verify
Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
On 6.12.18 г. 19:54 ч., David Sterba wrote: > On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote: >> >> >> On 3.12.18 г. 17:20 ч., Josef Bacik wrote: >>> Now with the delayed_refs_rsv we can now know exactly how much pending >>> delayed refs space we need. This means we can drastically simplify >> >> IMO it will be helpful if there is a sentence here referring back to >> btrfs_update_delayed_refs_rsv to put your first sentence into context. >> But I guess this is something David can also do. > > I'll update the changelog, but I'm not sure what exactly you want to see > there, please post the replacement text. Thanks. With the introduction of dealyed_refs_rsv infrastructure, namely btrfs_update_delayed_refs_rsv we now know exactly how much pending delayed refs space is required. > >>> btrfs_check_space_for_delayed_refs by simply checking how much space we >>> have reserved for the global rsv (which acts as a spill over buffer) and >>> the delayed refs rsv. If our total size is beyond that amount then we >>> know it's time to commit the transaction and stop any more delayed refs >>> from being generated. >>> >>> Signed-off-by: Josef Bacik >
Re: System unable to mount partition after a power loss
> This is qemu-kvm? What's the cache mode being used? It's possible the > usual write guarantees are thwarted by VM caching. Yes it is a proxmox host running the system so it is a qemu vm, I'm unsure on the caching situation. > Old version of progs, I suggest upgrading to 4.17.1 and run I updated the progs to 4.17 and ran the following btrfs insp dump-s -f /device/: See attachment btrfs rescue super -v /device/ (insp rescue super wasn't valid) All Devices: Device: id = 1, name = /dev/vda1 Before Recovering: [All good supers]: device name = /dev/vda1 superblock bytenr = 65536 device name = /dev/vda1 superblock bytenr = 67108864 device name = /dev/vda1 superblock bytenr = 274877906944 [All bad supers]: All supers are valid, no need to recover btrfs check --mode=lowmem /dev/vda1: parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=3563210342400 item=120 parent level=1 child level=1 ERROR: cannot open file system mount -o ro,norecovery,usebackuproot /dev/vda1 /mnt: Same dmesg output as before. On Fri, Dec 7, 2018 at 12:56 AM Chris Murphy wrote: > > On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby wrote: > > > > All, > > > > I'm coming to you to see if there is a way to fix or at least recover > > most of the data I have from a btrfs filesystem. The system went down > > after both a breaker and the battery backup failed. I cannot currently > > mount the system, with the following error from dmesg: > > > > Note: The vda1 is just the entire disk being passed from the VM host > > to the VM it's not an actual true virtual block device > > This is qemu-kvm? What's the cache mode being used? It's possible the > usual write guarantees are thwarted by VM caching. > > > > > btrfs check --recover also ends in a segmentation fault > > I'm not familiar with --recover option, the --repair option is flagged > with a warning in the man page. >Warning >Do not use --repair unless you are advised to do so by a > developer or an experienced user, > > > > btrfs --version: > > btrfs-progs v4.7.3 > > Old version of progs, I suggest upgrading to 4.17.1 and run > > btrfs insp dump-s -f /device/ > btrfs insp rescue super -v /device/ > btrfs check --mode=lowmem /device/ > > These are all read only commands. Please post output to the list, > hopefully a developer will get around to looking at it. > > It is safe to try: > > mount -o ro,norecovery,usebackuproot /device/ /mnt/ > > If that works, I suggest updating your backup while it's still > possible in the meantime. > > > -- > Chris Murphy superblock: bytenr=65536, device=/dev/vda1 - csum_type 0 (crc32c) csum_size 4 csum0xbfa6fd72 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid7c76bb05-b3dc-4804-bf56-88d010a214c6 label Array generation 5184693 root31801344 sys_array_size 226 chunk_root_generation 5183734 root_level 1 chunk_root 20971520 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 32003947737088 bytes_used 6652776640512 sectorsize 4096 nodesize16384 leafsize (deprecated) 16384 stripesize 4096 root_dir6 num_devices 1 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x161 ( MIXED_BACKREF |
Re: System unable to mount partition after a power loss
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby wrote: > > All, > > I'm coming to you to see if there is a way to fix or at least recover > most of the data I have from a btrfs filesystem. The system went down > after both a breaker and the battery backup failed. I cannot currently > mount the system, with the following error from dmesg: > > Note: The vda1 is just the entire disk being passed from the VM host > to the VM it's not an actual true virtual block device This is qemu-kvm? What's the cache mode being used? It's possible the usual write guarantees are thwarted by VM caching. > btrfs check --recover also ends in a segmentation fault I'm not familiar with --recover option, the --repair option is flagged with a warning in the man page. Warning Do not use --repair unless you are advised to do so by a developer or an experienced user, > btrfs --version: > btrfs-progs v4.7.3 Old version of progs, I suggest upgrading to 4.17.1 and run btrfs insp dump-s -f /device/ btrfs insp rescue super -v /device/ btrfs check --mode=lowmem /device/ These are all read only commands. Please post output to the list, hopefully a developer will get around to looking at it. It is safe to try: mount -o ro,norecovery,usebackuproot /device/ /mnt/ If that works, I suggest updating your backup while it's still possible in the meantime. -- Chris Murphy
System unable to mount partition after a power loss
All, I'm coming to you to see if there is a way to fix or at least recover most of the data I have from a btrfs filesystem. The system went down after both a breaker and the battery backup failed. I cannot currently mount the system, with the following error from dmesg: Note: The vda1 is just the entire disk being passed from the VM host to the VM it's not an actual true virtual block device [ 499.704398] BTRFS info (device vda1): disk space caching is enabled [ 499.704401] BTRFS info (device vda1): has skinny extents [ 499.739522] BTRFS error (device vda1): parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 [ 499.740257] BTRFS error (device vda1): parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 [ 499.770847] BTRFS error (device vda1): open_ctree failed I have tried running btrfsck: parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563224121344 wanted 5184691 found 5184688 parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 parent transid verify failed on 3563221630976 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563223138304 wanted 5184691 found 5184688 parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 parent transid verify failed on 3563224072192 wanted 5184691 found 5184688 parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 parent transid verify failed on 3563225268224 wanted 5184691 found 5184689 parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 parent transid verify failed on 3563227398144 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 parent transid verify failed on 3563229593600 wanted 5184691 found 5184689 Ignoring transid failure Checking filesystem on /dev/vda1 UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6 checking extents parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 parent transid verify failed on 3563224842240 wanted 5184691 found 5184689 parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 parent transid verify failed on 3563222974464 wanted 5184691 found 5184688 parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 parent transid verify failed on 3563223121920 wanted 5184691 found 5184688 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 parent transid verify failed on 3563229970432 wanted 5184691 found 5184689 Ignoring transid failure parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 parent transid verify failed on 3563231428608 wanted 5184691 found 5183327 Ignoring transid failure parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 parent transid verify failed on 3563231444992 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 parent transid verify failed on 3563231412224 wanted 5184691 found 5183325 Ignoring transid failure parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 parent transid verify failed on 3563231461376 wanted 5184691 found 5183325 Ignoring transid failure Segmentation fault btrfs check --recover also ends in a segmentation fault I am aware of chunk-recover and have tried to run it but got weary when I saw dev0 not vda1. Any help would be appreciated, Doni uname -a: Linux Homophone 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13)
Re: What if TRIM issued a wipe on devices that don't TRIM?
06.12.2018 16:04, Austin S. Hemmelgarn пишет: > > * On SCSI devices, a discard operation translates to a SCSI UNMAP > command. As pointed out by Ronnie Sahlberg in his reply, this command > is purely advisory, may not result in any actual state change on the > target device, and is not guaranteed to wipe the data. To actually wipe > things, you have to explicitly write bogus data to the given regions > (using either regular writes, or a WRITESAME command with the desired > pattern), and _then_ call UNMAP on them. WRITE SAME command has UNMAP bit and depending on device and kernel version kernel may actually issue either UNMAP or WRITE SAME with UNMAP bit set when doing discard.
Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote: > Hello, > > This is basically the resend of > "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the > root privileges of them" [1] > which I submitted in June. The aim of this series is to allow non-privileged > user > to use basic subvolume functionality (create/list/snapshot/delete; this > allows "list") > > They were once in devel branch with some whitespace/comment modification by > david. > I rebased them to current devel branch. > > github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list > > Basic logic/code is the same as before. Some differences are: > - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches). >As a result, "sub list" cannot accept an ordinary directry to be >specified (which is allowed in previous version) > - Drop patches which add new options to "sub list" > - Use 'nobody' as non-privileged test user just like libbtrfsutil test > - Update comments > > Importantly, in order to make output consistent for both root and > non-privileged > user, this changes the behavior of "subvolume list": > - (default) Only list in subvolume under the specified path. >Path needs to be a subvolume. > - (-a) filter is dropped. i.e. its output is the same as the > default behavior of "sub list" in progs <= 4.19 > > Therefore, existent scripts may need to update to add -a option > (I believe nobody uses current -a option). > If anyone thinks this is not good, please let me know. I think there are a few options in the case that the path isn't a subvolume: 1. List all subvolumes in the filesystem with randomly mangled paths, which is what we currently do. 2. Error out, which is what this version of the series does. 3. List all subvolumes under the containing subvolume, which is what the previous version does. 4. List all subvolumes under the containing subvolume that are underneath the given path. Option 1 won't work well for unprivileged users. Option 2 (this series) is definitely going to break people's workflows/scripts. Option 3 is unintuitive. In my opinion, option 4 is the nicest, but it may also break scripts that expect all subvolumes to be printed. There's also an option 5, which is to keep the behavior the same for root (like what my previous patch [1] did) and implement option 4 for unprivileged users. I think 4 and 5 are the two main choices: do we want to preserve backwards compatibility as carefully as possible (at the cost of consistency), or do we want to risk it and improve the interface? 1: https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0
[PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support
From: Omar Sandoval I apparently didn't test this on a pre-4.18 kernel. test_subvolume_info_unprivileged() checks for an ENOTTY, but this doesn't seem to work correctly with subTest(). test_subvolume_iterator_unprivileged() doesn't have a check at all. Add an explicit check to both before doing the actual test. Signed-off-by: Omar Sandoval --- Based on the devel branch. libbtrfsutil/python/tests/test_subvolume.py | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/libbtrfsutil/python/tests/test_subvolume.py b/libbtrfsutil/python/tests/test_subvolume.py index 99ec97bc..b06a1d3d 100644 --- a/libbtrfsutil/python/tests/test_subvolume.py +++ b/libbtrfsutil/python/tests/test_subvolume.py @@ -168,12 +168,13 @@ class TestSubvolume(BtrfsTestCase): with drop_privs(): try: -self._test_subvolume_info(subvol, snapshot) +btrfsutil.subvolume_info(self.mountpoint) except OSError as e: if e.errno == errno.ENOTTY: self.skipTest('BTRFS_IOC_GET_SUBVOL_INFO is not available') else: raise +self._test_subvolume_info(subvol, snapshot) def test_read_only(self): for arg in self.path_or_fd(self.mountpoint): @@ -487,6 +488,13 @@ class TestSubvolume(BtrfsTestCase): try: os.chdir(self.mountpoint) with drop_privs(): +try: +list(btrfsutil.SubvolumeIterator('.')) +except OSError as e: +if e.errno == errno.ENOTTY: +self.skipTest('BTRFS_IOC_GET_SUBVOL_ROOTREF is not available') +else: +raise self._test_subvolume_iterator() finally: os.chdir(pwd) -- 2.19.2
Re: BTRFS RAID filesystem unmountable
On 2018/12/7 上午7:15, Michael Wade wrote: > Hi Qu, > > Me again! Having formatted the drives and rebuilt the RAID array I > seem to have be having the same problem as before (no power cut this > time [I bought a UPS]). But strangely, your super block shows it has log tree, which means either your hit a kernel panic/transaction abort, or a unexpected power loss. > The brtfs volume is broken on my ReadyNAS. > > I have attached the results of some of the commands you asked me to > run last time, and I am hoping you might be able to help me out. This time, the problem is more serious, some chunk tree blocks are not even inside system chunk range, no wonder it fails to mount. To confirm it, you could run "btrfs ins dump-tree -b 17725903077376 " and paste the output. But I don't have any clue. My guess is some kernel problem related to new chunk allocation, or the chunk root node itself is already seriously corrupted. Considering how old your kernel is (4.4), it's not recommended to use btrfs on such old kernel, unless it's well backported with tons of btrfs fixes. Thanks, Qu > > Kind regards > Michael > On Sat, 19 May 2018 at 12:43, Michael Wade wrote: >> >> I have let the find root command run for 14+ days, its produced a >> pretty huge log file 1.6 GB but still hasn't completed. I think I will >> start the process of reformatting my drives and starting over. >> >> Thanks for your help anyway. >> >> Kind regards >> Michael >> >> On 5 May 2018 at 01:43, Qu Wenruo wrote: >>> >>> >>> On 2018年05月05日 00:18, Michael Wade wrote: Hi Qu, The tool is still running and the log file is now ~300mb. I guess it shouldn't normally take this long.. Is there anything else worth trying? >>> >>> I'm afraid not much. >>> >>> Although there is a possibility to modify btrfs-find-root to do much >>> faster but limited search. >>> >>> But from the result, it looks like underlying device corruption, and not >>> much we can do right now. >>> >>> Thanks, >>> Qu >>> Kind regards Michael On 2 May 2018 at 06:29, Michael Wade wrote: > Thanks Qu, > > I actually aborted the run with the old btrfs tools once I saw its > output. The new btrfs tools is still running and has produced a log > file of ~85mb filled with that content so far. > > Kind regards > Michael > > On 2 May 2018 at 02:31, Qu Wenruo wrote: >> >> >> On 2018年05月01日 23:50, Michael Wade wrote: >>> Hi Qu, >>> >>> Oh dear that is not good news! >>> >>> I have been running the find root command since yesterday but it only >>> seems to be only be outputting the following message: >>> >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >> >> It's mostly fine, as find-root will go through all tree blocks and try >> to read them as tree blocks. >> Although btrfs-find-root will suppress csum error output, but such basic >> tree validation check is not suppressed, thus you get such message. >> >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096 >>> >>> I tried with the latest btrfs tools compiled from source and the ones >>> I have installed with the same result. Is there a CLI utility I could >>> use to determine if the log contains any other content? >> >> Did it report any useful info at the end? >> >> Thanks, >> Qu >> >>> >>> Kind regards >>> Michael >>> >>> >>> On 30 April 2018 at 04:02, Qu Wenruo wrote: On 2018年04月29日 22:08, Michael Wade wrote: > Hi Qu, > > Got this error message: > > ./btrfs inspect dump-tree -b 20800943685632 /dev/md127 > btrfs-progs v4.16.1 > bytenr mismatch, want=20800943685632, have=3118598835113619663 > ERROR: cannot read chunk root > ERROR: unable to open /dev/md127 > > I have attached the dumps for: > > dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K > skip=266325721088 > dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K > skip=266359275520 Unfortunately, both dumps are corrupted and contain mostly garbage. I think it's the underlying stack (mdraid) has something wrong or failed to recover its data. This means your last
Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead
On 2018/12/7 上午3:35, David Sterba wrote: > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote: >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote: >>> This patchset can be fetched from github: >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased >>> >>> Which is based on v4.20-rc1. >> >> Thanks, I'll add it to for-next soon. > > The branch was there for some time but not for at least a week (my > mistake I did not notice in time). I've rebased it on top of recent > misc-next, but without the delayed refs patchset from Josef. > > At the moment I'm considering it for merge to 4.21, there's still some > time to pull it out in case it shows up to be too problematic. I'm > mostly worried about the unknown interactions with the enospc updates or For that part, I don't think it would have some obvious problem for enospc updates. As the user-noticeable effect is the delay of reloc tree deletion. Despite that, it's mostly transparent to extent allocation. > generally because of lack of qgroup and reloc code reviews. That's the biggest problem. However most of the current qgroup + balance optimization is done inside qgroup code (to skip certain qgroup record), if we're going to hit some problem then this patchset would have the highest possibility to hit problem. Later patches will just keep tweaking qgroup to without affecting any other parts mostly. So I'm fine if you decide to pull it out for now. Thanks, Qu > > I'm going to do some testing of the rebased branch before I add it to > for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos, > plase check if everyghing is still ok there. Thanks. > signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci
On Thu, 6 Dec 2018, Johannes Thumshirn wrote: > On 05/12/2018 15:46, Julia Lawall wrote: > [...] > >> +@r_patch depends on !context && patch && !org && !report@ > >> +expression E; > >> +type T; > >> +@@ > >> + > >> +( > >> +- E & ~PAGE_MASK > >> ++ offset_in_page(E) > >> +| > >> +- E & (PAGE_SIZE - 1) > >> ++ offset_in_page(E) > > > > The two lines above should be subsumed by the two lines below. When there > > is a type metavariable that has no other dependencies, an isomorphism will > > consider that it is either present or absent. > > Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might > take some iterations. > > Do you have an example for this? Expanation 1: Coccinelle as a file standard.iso that shows the isomorphisms (rewrite rules) that may be applied to semantic patches. One of the rules is: Expression @ not_ptr1 @ expression *X; @@ !X => X == NULL So if you have a pointer typed expression X and you write a transformation on !X, it will also apply to occurrences of X == NULL in the source code. In this way, you don't have to write so many variants. Likewise there is an isomorphism: Expression @ drop_cast @ expression E; pure type T; @@ (T)E => E That is, if you have a semantic patch with (T)X, then it will also apply to code that matches just X, without the cast. The word pure means that this isomorphism metavariable has to match a semantic patch term that is a metavariable and this metavariable can't be used elsewhere. If you wrote - (char)x Then you would probably not want that to apply without the (char) cast. But if you have just - (T)x for some randome unbound metavariable T, then perhaps you don't case about the cast to T. If you actually do, then you can put disable drop_cast in the header of your rule. Explanation 2: To see what your semantic patch is really doing, you can run spatch --parse-cocci sp.cocci Here is what I get for your patch rule, with some annotations added: @@ expression E; type T; @@ ( -E >>> offset_in_page(E) -& -~-PAGE_MASK | -~ >>> offset_in_page(E) -PAGE_MASK -& -E | // the following come from // - E & (PAGE_SIZE - 1) // + offset_in_page(E) -E // 1 >>> offset_in_page(E) -& -(-PAGE_SIZE -- -1-) | -E // 2 >>> offset_in_page(E) -& -PAGE_SIZE -- -1 | -( // 3 >>> offset_in_page(E) -PAGE_SIZE -- -1-) -& -E | -PAGE_SIZE // 4 >>> offset_in_page(E) -- -1 -& -E | // the following come from: // - E & ((T)PAGE_SIZE - 1) // + offset_in_page(E) -E >>> offset_in_page(E) -& -(-(-T -)-PAGE_SIZE -- -1-) | -E // same as 1 >>> offset_in_page(E) -& -(-PAGE_SIZE -- -1-) | -E >>> offset_in_page(E) -& -(-T -)-PAGE_SIZE -- -1 | -E // same as 2 >>> offset_in_page(E) -& -PAGE_SIZE -- -1 | -( >>> offset_in_page(E) -(-T -)-PAGE_SIZE -- -1-) -& -E | -( // same as 3 >>> offset_in_page(E) -PAGE_SIZE -- -1-) -& -E | -( >>> offset_in_page(E) -T -)-PAGE_SIZE -- -1 -& -E | -PAGE_SIZE // same as 4 >>> offset_in_page(E) -- -1 -& -E ) So all the transformation generated by - E & (PAGE_SIZE - 1) + offset_in_page(E) are also generated by - E & ((T)PAGE_SIZE - 1) + offset_in_page(E) I hope that is helpful. julia
Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead
On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote: > On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote: > > This patchset can be fetched from github: > > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased > > > > Which is based on v4.20-rc1. > > Thanks, I'll add it to for-next soon. The branch was there for some time but not for at least a week (my mistake I did not notice in time). I've rebased it on top of recent misc-next, but without the delayed refs patchset from Josef. At the moment I'm considering it for merge to 4.21, there's still some time to pull it out in case it shows up to be too problematic. I'm mostly worried about the unknown interactions with the enospc updates or generally because of lack of qgroup and reloc code reviews. I'm going to do some testing of the rebased branch before I add it to for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos, plase check if everyghing is still ok there. Thanks.
Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot
On Wed, Dec 05, 2018 at 12:12:21PM -0500, Josef Bacik wrote: > From: Josef Bacik > > With my delayed refs patches in place we started seeing a large amount > of aborts in __btrfs_free_extent > > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root > 35964 owner 1 offset 0 > Call Trace: > ? btrfs_merge_delayed_refs+0xaf/0x340 > __btrfs_run_delayed_refs+0x6ea/0xfc0 > ? btrfs_set_path_blocking+0x31/0x60 > btrfs_run_delayed_refs+0xeb/0x180 > btrfs_commit_transaction+0x179/0x7f0 > ? btrfs_check_space_for_delayed_refs+0x30/0x50 > ? should_end_transaction.isra.19+0xe/0x40 > btrfs_drop_snapshot+0x41c/0x7c0 > btrfs_clean_one_deleted_snapshot+0xb5/0xd0 > cleaner_kthread+0xf6/0x120 > kthread+0xf8/0x130 > ? btree_invalidatepage+0x90/0x90 > ? kthread_bind+0x10/0x10 > ret_from_fork+0x35/0x40 > > This was because btrfs_drop_snapshot depends on the root not being modified > while it's dropping the snapshot. It will unlock the root node (and really > every node) as it walks down the tree, only to re-lock it when it needs to do > something. This is a problem because if we modify the tree we could cow a > block > in our path, which free's our reference to that block. Then once we get back > to > that shared block we'll free our reference to it again, and get ENOENT when > trying to lookup our extent reference to that block in __btrfs_free_extent. > > This is ultimately happening because we have delayed items left to be > processed > for our deleted snapshot _after_ all of the inodes are closed for the > snapshot. > We only run the delayed inode item if we're deleting the inode, and even then > we > do not run the delayed insertions or delayed removals. These can be run at > any > point after our final inode does it's last iput, which is what triggers the > snapshot deletion. We can end up with the snapshot deletion happening and > then > have the delayed items run on that file system, resulting in the above > problem. > > This problem has existed forever, however my patches made it much easier to > hit > as I wake up the cleaner much more often to deal with delayed iputs, which > made > us more likely to start the snapshot dropping work before the transaction > commits, which is when the delayed items would generally be run. Before, > generally speaking, we would run the delayed items, commit the transaction, > and > wakeup the cleaner thread to start deleting snapshots, which means we were > less > likely to hit this problem. You could still hit it if you had multiple > snapshots to be deleted and ended up with lots of delayed items, but it was > definitely harder. > > Fix for now by simply running all the delayed items before starting to drop > the > snapshot. We could make this smarter in the future by making the delayed > items > per-root, and then simply drop any delayed items for roots that we are going > to > delete. But for now just a quick and easy solution is the safest. > > Cc: sta...@vger.kernel.org > Signed-off-by: Josef Bacik > --- > v1->v2: > - check for errors from btrfs_run_delayed_items. > - Dave I can reroll the series, but the second version of patch 1 is the same, > let me know what you want. As this is a small update it's fine to send just that patch. You may also use --in-reply-to so it threads to the original series. Resending series makes most sense (to me) when there's a discussion and many changes, so a fresh series makes it clear what's the current status. Patch replaced in for-next topic branch, thanks.
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On Fri, Nov 30, 2018 at 12:19:18PM -0500, Josef Bacik wrote: > On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote: > > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > > > > > From: Josef Bacik > > > > > > When debugging some weird extent reference bug I suspected that we were > > > changing a snapshot while we were deleting it, which could explain my > > > bug. This was indeed what was happening, and this patch helped me > > > verify my theory. It is never correct to modify the snapshot once it's > > > being deleted, so mark the root when we are deleting it and make sure we > > > complain about it when it happens. > > > > > > Signed-off-by: Josef Bacik > > > --- > > > fs/btrfs/ctree.c | 3 +++ > > > fs/btrfs/ctree.h | 1 + > > > fs/btrfs/extent-tree.c | 9 + > > > 3 files changed, 13 insertions(+) > > > > > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > > > index 5912a97b07a6..5f82f86085e8 100644 > > > --- a/fs/btrfs/ctree.c > > > +++ b/fs/btrfs/ctree.c > > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct > > > btrfs_trans_handle *trans, > > > u64 search_start; > > > int ret; > > > > > > + if (test_bit(BTRFS_ROOT_DELETING, >state)) > > > + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats > > > being dropped\n"); > > > > Please use btrfs_warn(), it makes sure we use a consistent message > > style, identifies the fs, etc. > > Also, "thats" should be "that is" or "that's". > > > > Ah yeah, I was following the other convention in there but we should probably > convert all of those to btrfs_warn. I'll fix the grammer thing as well, just > a > leftover from the much less code of conduct friendly message I originally had > there. Thanks, Committed with the following fixup: - WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being dropped\n"); + btrfs_error(fs_info, + "COW'ing blocks on a fs root that's being dropped");
Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote: > > > On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > > Now with the delayed_refs_rsv we can now know exactly how much pending > > delayed refs space we need. This means we can drastically simplify > > IMO it will be helpful if there is a sentence here referring back to > btrfs_update_delayed_refs_rsv to put your first sentence into context. > But I guess this is something David can also do. I'll update the changelog, but I'm not sure what exactly you want to see there, please post the replacement text. Thanks. > > btrfs_check_space_for_delayed_refs by simply checking how much space we > > have reserved for the global rsv (which acts as a spill over buffer) and > > the delayed refs rsv. If our total size is beyond that amount then we > > know it's time to commit the transaction and stop any more delayed refs > > from being generated. > > > > Signed-off-by: Josef Bacik
Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci
On 05/12/2018 15:46, Julia Lawall wrote: [...] >> +@r_patch depends on !context && patch && !org && !report@ >> +expression E; >> +type T; >> +@@ >> + >> +( >> +- E & ~PAGE_MASK >> ++ offset_in_page(E) >> +| >> +- E & (PAGE_SIZE - 1) >> ++ offset_in_page(E) > > The two lines above should be subsumed by the two lines below. When there > is a type metavariable that has no other dependencies, an isomorphism will > consider that it is either present or absent. Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might take some iterations. Do you have an example for this? > Why not include the cast case for the context and org cases? This is an oversight actually as I couldn't test it due to a bug in openSUSE's coccinelle rpm. Thanks, Johannes -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > Now with the delayed_refs_rsv we can now know exactly how much pending > delayed refs space we need. This means we can drastically simplify IMO it will be helpful if there is a sentence here referring back to btrfs_update_delayed_refs_rsv to put your first sentence into context. But I guess this is something David can also do. > btrfs_check_space_for_delayed_refs by simply checking how much space we > have reserved for the global rsv (which acts as a spill over buffer) and > the delayed refs rsv. If our total size is beyond that amount then we > know it's time to commit the transaction and stop any more delayed refs > from being generated. > > Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov > --- > fs/btrfs/ctree.h | 2 +- > fs/btrfs/extent-tree.c | 48 ++-- > fs/btrfs/inode.c | 4 ++-- > fs/btrfs/transaction.c | 2 +- > 4 files changed, 22 insertions(+), 34 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 2eba398c722b..30da075c042e 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct > btrfs_fs_info *fs_info, > } > > int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); > void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >const u64 start); > void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 5a2d0b061f57..07ef1b8087f7 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info > *fs_info, u64 csum_bytes) > return num_csums; > } > > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans) > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info) > { > - struct btrfs_fs_info *fs_info = trans->fs_info; > - struct btrfs_block_rsv *global_rsv; > - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready; > - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums; > - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs; > - u64 num_bytes, num_dirty_bgs_bytes; > - int ret = 0; > + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv; > + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; > + bool ret = false; > + u64 reserved; > > - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1); > - num_heads = heads_to_leaves(fs_info, num_heads); > - if (num_heads > 1) > - num_bytes += (num_heads - 1) * fs_info->nodesize; > - num_bytes <<= 1; > - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) * > - fs_info->nodesize; > - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info, > - num_dirty_bgs); > - global_rsv = _info->global_block_rsv; > + spin_lock(_rsv->lock); > + reserved = global_rsv->reserved; > + spin_unlock(_rsv->lock); > > /* > - * If we can't allocate any more chunks lets make sure we have _lots_ of > - * wiggle room since running delayed refs can create more delayed refs. > + * Since the global reserve is just kind of magic we don't really want > + * to rely on it to save our bacon, so if our size is more than the > + * delayed_refs_rsv and the global rsv then it's time to think about > + * bailing. >*/ > - if (global_rsv->space_info->full) { > - num_dirty_bgs_bytes <<= 1; > - num_bytes <<= 1; > - } > - > - spin_lock(_rsv->lock); > - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes) > - ret = 1; > - spin_unlock(_rsv->lock); > + spin_lock(_refs_rsv->lock); > + reserved += delayed_refs_rsv->reserved; > + if (delayed_refs_rsv->size >= reserved) > + ret = true; > + spin_unlock(_refs_rsv->lock); > return ret; > } > > @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct > btrfs_trans_handle *trans) > if (val >= NSEC_PER_SEC / 2) > return 2; > > - return btrfs_check_space_for_delayed_refs(trans); > + return btrfs_check_space_for_delayed_refs(trans->fs_info); > } > > struct async_delayed_refs { > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index a097f5fde31d..8532a2eb56d1 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle > *evict_refill_and_join(struct btrfs_root *root, >
Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > Over the years we have built up a lot of infrastructure to keep delayed > refs in check, mostly by running them at btrfs_end_transaction() time. > We have a lot of different maths we do to figure out how much, if we > should do it inline or async, etc. This existed because we had no > feedback mechanism to force the flushing of delayed refs when they > became a problem. However with the enospc flushing infrastructure in > place for flushing delayed refs when they put too much pressure on the > enospc system we have this problem solved. Rip out all of this code as > it is no longer needed. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 38 -- > 1 file changed, 38 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 2d8401bf8df9..01f39401619a 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -798,22 +798,12 @@ static int should_end_transaction(struct > btrfs_trans_handle *trans) > int btrfs_should_end_transaction(struct btrfs_trans_handle *trans) > { > struct btrfs_transaction *cur_trans = trans->transaction; > - int updates; > - int err; > > smp_mb(); > if (cur_trans->state >= TRANS_STATE_BLOCKED || > cur_trans->delayed_refs.flushing) > return 1; > > - updates = trans->delayed_ref_updates; > - trans->delayed_ref_updates = 0; > - if (updates) { > - err = btrfs_run_delayed_refs(trans, updates * 2); > - if (err) /* Error code will also eval true */ > - return err; > - } > - > return should_end_transaction(trans); > } > > @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct > btrfs_trans_handle *trans, > { > struct btrfs_fs_info *info = trans->fs_info; > struct btrfs_transaction *cur_trans = trans->transaction; > - u64 transid = trans->transid; > - unsigned long cur = trans->delayed_ref_updates; > int lock = (trans->type != TRANS_JOIN_NOLOCK); > int err = 0; > - int must_run_delayed_refs = 0; > > if (refcount_read(>use_count) > 1) { > refcount_dec(>use_count); > @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct > btrfs_trans_handle *trans, > btrfs_trans_release_metadata(trans); > trans->block_rsv = NULL; > > - if (!list_empty(>new_bgs)) > - btrfs_create_pending_block_groups(trans); Is this being deleted because in delayed_refs_rsv you account also fo new block groups? > - > - trans->delayed_ref_updates = 0; > - if (!trans->sync) { > - must_run_delayed_refs = > - btrfs_should_throttle_delayed_refs(trans); > - cur = max_t(unsigned long, cur, 32); > - > - /* > - * don't make the caller wait if they are from a NOLOCK > - * or ATTACH transaction, it will deadlock with commit > - */ > - if (must_run_delayed_refs == 1 && > - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH))) > - must_run_delayed_refs = 2; > - } > - > - btrfs_trans_release_metadata(trans); > - trans->block_rsv = NULL; Why remove those 2 lines as well ? > - > if (!list_empty(>new_bgs)) > btrfs_create_pending_block_groups(trans); > > @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct > btrfs_trans_handle *trans, > } > > kmem_cache_free(btrfs_trans_handle_cachep, trans); > - if (must_run_delayed_refs) { > - btrfs_async_run_delayed_refs(info, cur, transid, > - must_run_delayed_refs == 1); > - } > return err; > } > >
Re: [PATCH 00/10][V2] Delayed refs rsv
On Mon, Dec 03, 2018 at 10:20:28AM -0500, Josef Bacik wrote: > v1->v2: > - addressed the comments from the various reviewers. > - split "introduce delayed_refs_rsv" into 5 patches. The patches are the same > together as they were, just split out more logically. They can't really be > bisected across in that you will likely have fun enospc failures, but they > compile properly. This was done to make it easier for review. Thanks, that was helpful. The bisectability is IMO not hurt, as it compiles and runs despite the potential ENOSPC problems. I guess the point is to be able to reach any commit & compile & run and not be hit by unrelated bugs. If somebody is bisecting an ENOSPC problem and lands in the middle of the series, there's a hint in the changelogs that something is going on. > -- Original message -- > > This patchset changes how we do space reservations for delayed refs. We were > hitting probably 20-40 enospc abort's per day in production while running > delayed refs at transaction commit time. This means we ran out of space in > the > global reserve and couldn't easily get more space in use_block_rsv(). > > The global reserve has grown to cover everything we don't reserve space > explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if > we're running short on space and when it's time to force a commit. A failure > rate of 20-40 file systems when we run hundreds of thousands of them isn't > super > high, but cleaning up this code will make things less ugly and more > predictible. > > Thus the delayed refs rsv. We always know how many delayed refs we have > outstanding, and although running them generates more we can use the global > reserve for that spill over, which fits better into it's desired use than a > full > blown reservation. This first approach is to simply take how many times we're > reserving space for and multiply that by 2 in order to save enough space for > the > delayed refs that could be generated. This is a niave approach and will > probably evolve, but for now it works. > > With this patchset we've gone down to 2-8 failures per week. It's not > perfect, > there are some corner cases that still need to be addressed, but is > significantly better than what we had. Thanks, I've merged the series to misc-next, the amount of reviews and testing is sufficient. Though there are some more comments or suggestions for improvements, they can be done as followups. The other patchsets from you are namely fixes, that can be applied at the -rc time, for that reason I'm glad the main infrastructure change can be merged before the 4.21 code freeze.
Re: [PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED
On Wed, Dec 05, 2018 at 03:23:02PM +0100, Johannes Thumshirn wrote: > Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them > throughout btrfs. > > This series also includes a patch for 'make coccicheck' which is marked as an > RFC and I've CCed Julia in the hoping to get input from her. > > Johannes Thumshirn (3): > btrfs: use offset_in_page instead of open-coding it > btrfs: use PAGE_ALIGNED instead of open-coding it Added to misc-next, thanks.
Re: What if TRIM issued a wipe on devices that don't TRIM?
On 2018-12-06 01:11, Robert White wrote: (1) Automatic and selective wiping of unused and previously used disk blocks is a good security measure, particularly when there is an encryption layer beneath the file system. (2) USB attached devices _never_ support TRIM and they are the most likely to fall into strangers hands. Not true on the first count. Some really nice UAS devices do support SCSI UNMAP and WRITESAME commands. (3) I vaguely recall that some flash chips will take bulk writhes of full sectors of 0x00 or 0xFF (I don't remember which) were second-best to TRIM for letting the flash controllers defragment their internals. So it would be dog-slow, but it would be neat if BTRFS had a mount option to convert any TRIM command from above into the write of a zero, 0xFF, or trash block to the device below if that device doesn't support TRIM. Real TRIM support would override the block write. Obviously doing an fstrim would involve a lot of slow device writes but only for people likely to do that sort of thing. For testing purposes the destruction of unused pages in this manner might catch file system failures or coding errors. (The other layer where this might be most appropriate is in cryptsetup et al, where it could lie about TRIM support, but that sort of stealth lag might be bad for filesystem-level operations. Doing it there would also loose the simpler USB use cases.) ...Just a thought... First off, TRIM is an ATA command, not the kernel term. `fstrim` inherited the ATA name, but in kernel it's called a discard operation, and it's kind of important to understand here that a discard operation can result in a number of different behaviors. In particular, you have at least the following implementations: * On SCSI devices, a discard operation translates to a SCSI UNMAP command. As pointed out by Ronnie Sahlberg in his reply, this command is purely advisory, may not result in any actual state change on the target device, and is not guaranteed to wipe the data. To actually wipe things, you have to explicitly write bogus data to the given regions (using either regular writes, or a WRITESAME command with the desired pattern), and _then_ call UNMAP on them. * On dm-thinp devices, a discard operation results in simply unmapping the blocks in the region it covers. The underlying blocks themselves are not wiped until they get reallocated (which may not happen when you write to that region of the dm-thinp device again), and may not even be wiped then (depending on how the dm-thinp device is configured). Thus, the same behavior as for SCSI is required here. * On SD/MMC devices, a discard operation results in an SD ERASE command being issued. This one is non-advisory (that is, it's guaranteed to happen), and is supposed to guarantee an overwrite of the region with zeroes or ones. * eMMC devices additionally define a discard operation independent of the SD ERASE command which unmaps the region in the translation layer, but does not wipe the blocks either on issuing the command or on re-allocating the low-level blocks. Essentially, it's just a hint for the wear-leveling algorithm. * NVMe provides two different discard operations, and I'm not sure which the kernel uses for NVMe block emulation. They correspond almost exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior. * For ATA devices, a discard operation translates to an ATA TRIM command. This command doesn't even require that the data read back from a region the command has been issued against be consistent between reads, let alone that it actually returns zeroes, and it is completely silent on how the device should actually implement the operation. In practice, most drives that implement it actually behave like dm-thinp devices, unmapping the low-level blocks in the region and only clearing them when they get reallocated, while returning any data they want on subsequent reads to that logical region until a write happens. * The MTD subsystem has support for discard operations in the various FTL's, and they appear from a cursory look at the code to behave like a non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the concept of a discard operation was originally implemented in Linux for). Notice that the only implementations that are actually guaranteed to clear out the low-level physical blocks are the SD ERASE and one of the two NVMe options, and all others require you to manually wipe the data before issuing the discard operation to guarantee that no data is retained. Given this, I don't think this should be done as a mechanism of intercepting or translating discard operations, but as something else entirely. Perhaps as a block-layer that wipes the region then issues a discard for it to the lower level device if the device supports it?
Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > Any space used in the delayed_refs_rsv will be freed up by a transaction > commit, so instead of just counting the pinned space we also need to > account for any space in the delayed_refs_rsv when deciding if it will > make a different to commit the transaction to satisfy our space > reservation. If we have enough bytes to satisfy our reservation ticket > then we are good to go, otherwise subtract out what space we would gain > back by committing the transaction and compare that against the pinned > space to make our decision. > > Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov However, look below for one suggestion: > --- > fs/btrfs/extent-tree.c | 24 +++- > 1 file changed, 15 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index aa0a638d0263..63ff9d832867 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info > *fs_info, > { > struct reserve_ticket *ticket = NULL; > struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv; > + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv; > struct btrfs_trans_handle *trans; > - u64 bytes; > + u64 bytes_needed; > + u64 reclaim_bytes = 0; > > trans = (struct btrfs_trans_handle *)current->journal_info; > if (trans) > @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct > btrfs_fs_info *fs_info, > else if (!list_empty(_info->tickets)) > ticket = list_first_entry(_info->tickets, > struct reserve_ticket, list); > - bytes = (ticket) ? ticket->bytes : 0; > + bytes_needed = (ticket) ? ticket->bytes : 0; > spin_unlock(_info->lock); > > - if (!bytes) > + if (!bytes_needed) > return 0; > > /* See if there is enough pinned space to make this reservation */ > if (__percpu_counter_compare(_info->total_bytes_pinned, > -bytes, > +bytes_needed, > BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) > goto commit; > > @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct > btrfs_fs_info *fs_info, > return -ENOSPC; If we remove this : if (space_info != delayed_rsv->space_info) return -ENOSPC; Check, can't we move the reclaim_bytes calc code above the __percpu_counter_compare and eventually be left with just a single invocation to percpu_compare. The diff should looke something along the lines of: @@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) - goto commit; - - /* -* See if there is some space in the delayed insertion reservation for -* this reservation. -*/ - if (space_info != delayed_rsv->space_info) - return -ENOSPC; - spin_lock(_rsv->lock); if (delayed_rsv->size > bytes) bytes = 0; @@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (__percpu_counter_compare(_info->total_bytes_pinned, bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { + BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) return -ENOSPC; - } commit: trans = btrfs_join_transaction(fs_info->extent_root); > > spin_lock(_rsv->lock); > - if (delayed_rsv->size > bytes) > - bytes = 0; > - else > - bytes -= delayed_rsv->size; > + reclaim_bytes += delayed_rsv->reserved; > spin_unlock(_rsv->lock); > > + spin_lock(_refs_rsv->lock); > + reclaim_bytes += delayed_refs_rsv->reserved; > + spin_unlock(_refs_rsv->lock); > + if (reclaim_bytes >= bytes_needed) > + goto commit; > + bytes_needed -= reclaim_bytes; > + > if (__percpu_counter_compare(_info->total_bytes_pinned, > -bytes, > +bytes_needed, > BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { > return -ENOSPC; > } >
Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > From: Josef Bacik > > We were missing some quota cleanups in check_ref_cleanup, so break the > ref head accounting cleanup into a helper and call that from both > check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that > we don't screw up accounting in the future for other things that we add. > > Reviewed-by: Omar Sandoval > Reviewed-by: Liu Bo > Signed-off-by: Josef Bacik Doesn't this also need a stable tag? Furthermore, doesn't the missing code dealing with total_bytes_pinned in check_ref_cleanup mean that every time the last reference for a block was freed we were leaking bytes in total_bytes_pinned? Shouldn't this have lead to eventually total_bytes_pinned dominating the usage in a space_info ? Codewise lgtm: Reviewed-by: Nikolay Borisov > --- > fs/btrfs/extent-tree.c | 67 > +- > 1 file changed, 39 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index c36b3a42f2bb..e3ed3507018d 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle > *trans, > return ret ? ret : 1; > } > > +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > +{ > + struct btrfs_fs_info *fs_info = trans->fs_info; > + struct btrfs_delayed_ref_root *delayed_refs = > + >transaction->delayed_refs; > + > + if (head->total_ref_mod < 0) { > + struct btrfs_space_info *space_info; > + u64 flags; > + > + if (head->is_data) > + flags = BTRFS_BLOCK_GROUP_DATA; > + else if (head->is_system) > + flags = BTRFS_BLOCK_GROUP_SYSTEM; > + else > + flags = BTRFS_BLOCK_GROUP_METADATA; > + space_info = __find_space_info(fs_info, flags); > + ASSERT(space_info); > + percpu_counter_add_batch(_info->total_bytes_pinned, > +-head->num_bytes, > +BTRFS_TOTAL_BYTES_PINNED_BATCH); > + > + if (head->is_data) { > + spin_lock(_refs->lock); > + delayed_refs->pending_csums -= head->num_bytes; > + spin_unlock(_refs->lock); > + } > + } > + > + /* Also free its reserved qgroup space */ > + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > + head->qgroup_reserved); > +} > + > static int cleanup_ref_head(struct btrfs_trans_handle *trans, > struct btrfs_delayed_ref_head *head) > { > @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(>lock); > spin_unlock(_refs->lock); > > - trace_run_delayed_ref_head(fs_info, head, 0); > - > - if (head->total_ref_mod < 0) { > - struct btrfs_space_info *space_info; > - u64 flags; > - > - if (head->is_data) > - flags = BTRFS_BLOCK_GROUP_DATA; > - else if (head->is_system) > - flags = BTRFS_BLOCK_GROUP_SYSTEM; > - else > - flags = BTRFS_BLOCK_GROUP_METADATA; > - space_info = __find_space_info(fs_info, flags); > - ASSERT(space_info); > - percpu_counter_add_batch(_info->total_bytes_pinned, > --head->num_bytes, > -BTRFS_TOTAL_BYTES_PINNED_BATCH); > - > - if (head->is_data) { > - spin_lock(_refs->lock); > - delayed_refs->pending_csums -= head->num_bytes; > - spin_unlock(_refs->lock); > - } > - } > - > if (head->must_insert_reserved) { > btrfs_pin_extent(fs_info, head->bytenr, >head->num_bytes, 1); > @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > } > } > > - /* Also free its reserved qgroup space */ > - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > - head->qgroup_reserved); > + cleanup_ref_head_accounting(trans, head); > + > + trace_run_delayed_ref_head(fs_info, head, 0); > btrfs_delayed_ref_unlock(head); > btrfs_put_delayed_ref_head(head); > return 0; > @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (head->must_insert_reserved) > ret = 1; > > + cleanup_ref_head_accounting(trans, head); > mutex_unlock(>mutex); > btrfs_put_delayed_ref_head(head); > return ret; >
Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper
On 3.12.18 г. 17:20 ч., Josef Bacik wrote: > From: Josef Bacik > > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it > into a helper and cleanup the calling functions. > > Signed-off-by: Josef Bacik > Reviewed-by: Omar Sandoval Reviewed-by: Nikolay Borisov > --- > fs/btrfs/delayed-ref.c | 14 ++ > fs/btrfs/delayed-ref.h | 3 ++- > fs/btrfs/extent-tree.c | 22 +++--- > 3 files changed, 19 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 9301b3ad9217..b3e4c9fcb664 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( > return head; > } > > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head) > +{ > + lockdep_assert_held(_refs->lock); > + lockdep_assert_held(>lock); > + > + rb_erase_cached(>href_node, _refs->href_root); > + RB_CLEAR_NODE(>href_node); > + atomic_dec(_refs->num_entries); > + delayed_refs->num_heads--; > + if (head->processing == 0) > + delayed_refs->num_heads_ready--; > +} > + > /* > * Helper to insert the ref_node to the tail or merge with tail. > * > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h > index 8e20c5cb5404..d2af974f68a1 100644 > --- a/fs/btrfs/delayed-ref.h > +++ b/fs/btrfs/delayed-ref.h > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct > btrfs_delayed_ref_head *head) > { > mutex_unlock(>mutex); > } > - > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head); > > struct btrfs_delayed_ref_head *btrfs_select_ref_head( > struct btrfs_delayed_ref_root *delayed_refs); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index d242a1174e50..c36b3a42f2bb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(_refs->lock); > return 1; > } > - delayed_refs->num_heads--; > - rb_erase_cached(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > + btrfs_delete_ref_head(delayed_refs, head); > spin_unlock(>lock); > spin_unlock(_refs->lock); > - atomic_dec(_refs->num_entries); > > trace_run_delayed_ref_head(fs_info, head, 0); > > @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (!mutex_trylock(>mutex)) > goto out; > > - /* > - * at this point we have a head with no other entries. Go > - * ahead and process it. > - */ > - rb_erase_cached(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > - atomic_dec(_refs->num_entries); > - > - /* > - * we don't take a ref on the node because we're removing it from the > - * tree, so we just steal the ref the tree was holding. > - */ > - delayed_refs->num_heads--; > - if (head->processing == 0) > - delayed_refs->num_heads_ready--; > + btrfs_delete_ref_head(delayed_refs, head); > head->processing = 0; > + > spin_unlock(>lock); > spin_unlock(_refs->lock); > >
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 11:07 06/12, Johannes Thumshirn wrote: > On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > > This is a support for DAX in btrfs. I understand there have been > > previous attempts at it. However, I wanted to make sure copy-on-write > > (COW) works on dax as well. > > > > Before I present this to the FS folks I wanted to run this through the > > btrfs. Even though I wish, I cannot get it correct the first time > > around :/.. Here are some questions for which I need suggestions: > > Hi Goldwyn, > > I've thrown your patches (from your git tree) onto one of my pmem test > machines with this pmem config: Thanks. I will check on this. Ordered extents have been a pain to deal with for me (though mainly because of my incorrect usage) > > mayhem:~/:[0]# ndctl list > [ > { > "dev":"namespace1.0", > "mode":"fsdax", > "map":"dev", > "size":792721358848, > "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4", > "raw_uuid":"49264743-2351-41c5-9db9-38534813df61", > "sector_size":512, > "blockdev":"pmem1", > "numa_node":1 > }, > { > "dev":"namespace0.0", > "mode":"fsdax", > "map":"dev", > "size":792721358848, > "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5", > "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909", > "sector_size":4096, > "blockdev":"pmem0", > "numa_node":0 > } > ] > > Unfortunately I hit a btrfs_panic() with btrfs/002. > export TEST_DEV=/dev/pmem0 > export SCRATCH_DEV=/dev/pmem1 > export MOUNT_OPTIONS="-o dax" > ./check > [...] > [ 178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43 > [ 178.357044] BTRFS info (device pmem0): disk space caching is enabled > [ 178.357047] BTRFS info (device pmem0): has skinny extents > [ 178.360042] BTRFS info (device pmem0): enabling ssd optimizations > [ 178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8 > devid 1 transid 3 /dev/pmem1 > [ 178.505717] BTRFS info (device pmem1): disk space caching is enabled > [ 178.513593] BTRFS info (device pmem1): has skinny extents > [ 178.520384] BTRFS info (device pmem1): flagging fs with big metadata > feature > [ 178.530997] BTRFS info (device pmem1): enabling ssd optimizations > [ 178.538331] BTRFS info (device pmem1): creating UUID tree > [ 178.587200] BTRFS critical (device pmem1): panic in > ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0 > (errno=-17 Object already exists) > [ 178.603129] [ cut here ] > [ 178.608667] kernel BUG at fs/btrfs/ordered-data.c:57! > [ 178.614333] invalid opcode: [#1] SMP PTI > [ 178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G > E 4.20.0-rc5-default-btrfs-dax #920 > [ 178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS > SE5C620.86B.0D.01.0010.072020182008 07/20/2018 > [ 178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs] > [ 178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba > 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06 > 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0 > [ 178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096 > [ 178.672684] RAX: 008f RBX: 9770c2ac5748 RCX: > > [ 178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI: > 97711f9d6868 > [ 178.687831] RBP: 97711d523000 R08: R09: > 065a > [ 178.695411] R10: 03ff R11: 0001 R12: > 97710d66da70 > [ 178.702993] R13: 9770c2ac5600 R14: R15: > 97710d66d9c0 > [ 178.710573] FS: 7fe11ef90700() GS:97711f9c() > knlGS: > [ 178.719122] CS: 0010 DS: ES: CR0: 80050033 > [ 178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4: > 007606e0 > [ 178.732999] DR0: DR1: DR2: > > [ 178.740574] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 178.748147] PKRU: 5554 > [ 178.751297] Call Trace: > [ 178.754230] btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs] > [ 178.760269] btrfs_create_dio_extent+0x79/0xe0 [btrfs] > [ 178.765930] btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs] > [ 178.771959] btrfs_file_dax_write+0x1f8/0x4f0 [btrfs] > [ 178.777508] ? current_time+0x3f/0x70 > [ 178.781672] btrfs_file_write_iter+0x384/0x580 [btrfs] > [ 178.787265] ? pipe_read+0x243/0x2a0 > [ 178.791298] __vfs_write+0xee/0x170 > [ 178.795241] vfs_write+0xad/0x1a0 > [ 178.799008] ? vfs_read+0x111/0x130 > [ 178.802949] ksys_write+0x42/0x90 > [ 178.806712] do_syscall_64+0x5b/0x180 > [ 178.810829] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 178.816334] RIP: 0033:0x7fe11eabb3d0 > [ 178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83 > c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f > 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89
Re: [PATCH 07/10] dax: export functions for use with btrfs
On 6:52 05/12, Christoph Hellwig wrote: > If you want to export these at all they have to be EXPORT_SYMBOL_GPL. > Understood. > But I'd really like to avoid seeing another duplicate DAX I/O path. > Please try to adopt the existing iomap-based infrastructure for your > needs first. This is not worth with btrfs. With non-page aligned I/O on btrfs, we need to copy the first/last page of the extents for CoW. So, we would end up using the exported functions anyways. Believe me, I have spent some time getting btrfs iomap compatible before giving up. The problems are btrfs needs to carry a lot of information across iomap_begin and iomap_end. While the added private variable helps in this, it also needs hooks in bio_submit() functions for crc calculations during direct writes. -- Goldwyn
HELP unmountable partition after btrfs balance to RAID0
Dear developers of BTRFS, we have a problem. We wanted to convert a file system to a RAID0 with two partitions. Unfortunately we had to reboot the server during the balance operation before it could complete. Now following happens: A mount attempt of the array fails with following error code: btrfs recover yields roughly 1.6 out of 4 TB. to recover the rest we have tried: mount: [18192.357444] BTRFS info (device sdb1): disk space caching is enabled [18192.357447] BTRFS info (device sdb1): has skinny extents [18192.370664] BTRFS error (device sdb1): parent transid verify failed on 30523392 wanted 7432 found 7445 [18192.370810] BTRFS error (device sdb1): parent transid verify failed on 30523392 wanted 7432 found 7445 [18192.394745] BTRFS error (device sdb1): open_ctree failed mounting with options ro, degraded, cache_clear etc yields the same errors. btrfs rescue zero-log. This operation works, however, the error persists and the array remains unmountable parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 Ignoring transid failure parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 Ignoring transid failure Clearing log on /dev/sdb1, previous log_root 0, level 0 btrfs rescue chunk-recover fails with following error message: btrfs check results in: Opening filesystem to check... parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 parent transid verify failed on 59768832 wanted 7422 found 7187 Ignoring transid failure parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 parent transid verify failed on 30408704 wanted 7430 found 7443 Ignoring transid failure Checking filesystem on /dev/sdb1 UUID: 6c9ed4e1-d63f-46f0-b1e9-608b8fa43bb8 [1/7] checking root items parent transid verify failed on 30523392 wanted 7432 found 7443 parent transid verify failed on 30523392 wanted 7432 found 7443 parent transid verify failed on 30523392 wanted 7432 found 7443 parent transid verify failed on 30523392 wanted 7432 found 7443 Ignoring transid failure leaf parent key incorrect 30523392ERROR: failed to repair root items: Operation not permitted Any ideas what is going on or how to recover the file system ? I would greatly appreciate your help !!! best, Thomas uname -a: Linux server2 4.19.5-1-default #1 SMP PREEMPT Tue Nov 27 19:56:09 UTC 2018 (6210279) x86_64 x86_64 x86_64 GNU/Linux btrfs-progs version 4.19 -- ScienceConsult - DI Thomas Mohr KG DI Thomas Mohr Enzianweg 10a 2353 Guntramsdorf Austria +43 2236 56793 +43 660 461 1966 http://www.mohrkeg.co.at
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. I understand there have been > previous attempts at it. However, I wanted to make sure copy-on-write > (COW) works on dax as well. > > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: Hi Goldwyn, I've thrown your patches (from your git tree) onto one of my pmem test machines with this pmem config: mayhem:~/:[0]# ndctl list [ { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":792721358848, "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4", "raw_uuid":"49264743-2351-41c5-9db9-38534813df61", "sector_size":512, "blockdev":"pmem1", "numa_node":1 }, { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":792721358848, "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5", "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909", "sector_size":4096, "blockdev":"pmem0", "numa_node":0 } ] Unfortunately I hit a btrfs_panic() with btrfs/002. export TEST_DEV=/dev/pmem0 export SCRATCH_DEV=/dev/pmem1 export MOUNT_OPTIONS="-o dax" ./check [...] [ 178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43 [ 178.357044] BTRFS info (device pmem0): disk space caching is enabled [ 178.357047] BTRFS info (device pmem0): has skinny extents [ 178.360042] BTRFS info (device pmem0): enabling ssd optimizations [ 178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8 devid 1 transid 3 /dev/pmem1 [ 178.505717] BTRFS info (device pmem1): disk space caching is enabled [ 178.513593] BTRFS info (device pmem1): has skinny extents [ 178.520384] BTRFS info (device pmem1): flagging fs with big metadata feature [ 178.530997] BTRFS info (device pmem1): enabling ssd optimizations [ 178.538331] BTRFS info (device pmem1): creating UUID tree [ 178.587200] BTRFS critical (device pmem1): panic in ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists) [ 178.603129] [ cut here ] [ 178.608667] kernel BUG at fs/btrfs/ordered-data.c:57! [ 178.614333] invalid opcode: [#1] SMP PTI [ 178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G E 4.20.0-rc5-default-btrfs-dax #920 [ 178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS SE5C620.86B.0D.01.0010.072020182008 07/20/2018 [ 178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs] [ 178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0 [ 178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096 [ 178.672684] RAX: 008f RBX: 9770c2ac5748 RCX: [ 178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI: 97711f9d6868 [ 178.687831] RBP: 97711d523000 R08: R09: 065a [ 178.695411] R10: 03ff R11: 0001 R12: 97710d66da70 [ 178.702993] R13: 9770c2ac5600 R14: R15: 97710d66d9c0 [ 178.710573] FS: 7fe11ef90700() GS:97711f9c() knlGS: [ 178.719122] CS: 0010 DS: ES: CR0: 80050033 [ 178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4: 007606e0 [ 178.732999] DR0: DR1: DR2: [ 178.740574] DR3: DR6: fffe0ff0 DR7: 0400 [ 178.748147] PKRU: 5554 [ 178.751297] Call Trace: [ 178.754230] btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs] [ 178.760269] btrfs_create_dio_extent+0x79/0xe0 [btrfs] [ 178.765930] btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs] [ 178.771959] btrfs_file_dax_write+0x1f8/0x4f0 [btrfs] [ 178.777508] ? current_time+0x3f/0x70 [ 178.781672] btrfs_file_write_iter+0x384/0x580 [btrfs] [ 178.787265] ? pipe_read+0x243/0x2a0 [ 178.791298] __vfs_write+0xee/0x170 [ 178.795241] vfs_write+0xad/0x1a0 [ 178.799008] ? vfs_read+0x111/0x130 [ 178.802949] ksys_write+0x42/0x90 [ 178.806712] do_syscall_64+0x5b/0x180 [ 178.810829] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 178.816334] RIP: 0033:0x7fe11eabb3d0 [ 178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24 [ 178.840052] RSP: 002b:7ffec969d978 EFLAGS: 0246 ORIG_RAX: 0001 [ 178.848100] RAX: ffda RBX: RCX: 7fe11eabb3d0 [ 178.855715] RDX: 0400 RSI: 0156a000 RDI: 0001 [ 178.863326] RBP: 0400 R08: 0003 R09: 7fe11ed7a698 [ 178.870928] R10: 10a8b550 R11: 0246
Re: What if TRIM issued a wipe on devices that don't TRIM?
Hi, I am more of a SCSI guy than ATA so forgive where I am ignorant. The SCSI equivalent to TRIM is called UNMAP. UNMAP is unfortunately only a "hint" to the device so if the device for any reason is busy, it can just do a NO-OP, leave the data as is and still return status SUCCESS. That is not good if you want to wipe data for confidentiality reasons :-) As UNMAP and TRIM are related make sure that TRIM actually provides a guarantee to wipe the data. I do not know ATA enough to know if it does or not. In SCSI, instead of using UNMAP/TRIM you can use WRITESAME10/16 which can be used and which does providde an overwrite/wipe guarantee. Maybe ATA has something equivalent to WRITESAME10/16. I just want to say: be careful, sometimes these commands do not provide a guarantee that they will actually make the data overwritten/unretrievable. ronnie sahlberg On Thu, Dec 6, 2018 at 4:24 PM Robert White wrote: > > (1) Automatic and selective wiping of unused and previously used disk > blocks is a good security measure, particularly when there is an > encryption layer beneath the file system. > > (2) USB attached devices _never_ support TRIM and they are the most > likely to fall into strangers hands. > > (3) I vaguely recall that some flash chips will take bulk writhes of > full sectors of 0x00 or 0xFF (I don't remember which) were second-best > to TRIM for letting the flash controllers defragment their internals. > > So it would be dog-slow, but it would be neat if BTRFS had a mount > option to convert any TRIM command from above into the write of a zero, > 0xFF, or trash block to the device below if that device doesn't support > TRIM. Real TRIM support would override the block write. > > Obviously doing an fstrim would involve a lot of slow device writes but > only for people likely to do that sort of thing. > > For testing purposes the destruction of unused pages in this manner > might catch file system failures or coding errors. > > (The other layer where this might be most appropriate is in cryptsetup > et al, where it could lie about TRIM support, but that sort of stealth > lag might be bad for filesystem-level operations. Doing it there would > also loose the simpler USB use cases.) > > ...Just a thought... > > --Rob White. > >
New RED Racing Parts: 10% / 15% off and free shipping
Dear Rider, RED Racing Parts offers the new full line of carbon fiber parts for your motorbike. For road / offroad motorcycles, visit our website on https://www.redracingparts.com/english/motorbikesmotorcycles/productsandcomponents/general/intro/carbonfibrefiber.php For trial motorcycles, visit our website on https://www.redracingparts.com/english/motorbikesmotorcycles/productsandcomponents/general/intro/2mCarbonFiberParts.php Free shipping and 10% OFF (15% OFF paying with Bitcoin) for today only. RED Racing Parts Staff If you found this email useful, please forward it on to your friends. To unsubscribe our newsletters click here https://www.redracingparts.com/news/u.php?l=e=a0xle2hgvdqrzhrjmqtplinux-bt...@vger.kernel.org
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 12/5/18 9:37 PM, Jeff Mahoney wrote: The high level idea that Jan Kara and I came up with in our conversation at Labs conf is pretty expensive. We'd need to set a flag that pauses new page faults, set the WP bit on affected ranges, do the snapshot, commit, clear the flag, and wake up the waiting threads. Neither of us had any concrete idea of how well that would perform and it still depends on finding a good way to resolve all open mmap ranges on a subvolume. Perhaps using the address_space->private_list anchored on each root would work. This is a potentially wild idea, so "grain of salt" and all that. I may misuse the exact wording. So the essential problem of DAX is basically the opposite of data-deduplication. Instead of merging two duplicate data regions, you want to mark regions as at-risk while keeping the original content intact if there are snapshots in conflict. So suppose you _require_ data checksums and data mode of "dup" or mirror or one of the other fault tolerant layouts. By definition any block that gets written with content that it didn't have before will now have a bad checksum. If the inode is flagged for direct IO that's an indication that the block has been updated. At this point you really just need to do the opposite of deduplication, as in find/recover the original contents and assign/leave assigned those to the old/other snapshots, then compute the new checksum on the "original block" and assign it to the active subvolume. So when a region is mapped for direct IO, and it's refcount is greater than one, and you get to a sync or close event, you "recover" the old contents into a new location and assign those to "all the other users". Now that original storage region has only one user, so on sync or close you fix its checksums on the cheap. Instead of the new data being a small rock sitting over a large rug to make a lump, the new data is like a rock being slid under the rug to make a lump. So the first write to an extent creates a burdensome copy to retain the old contents, but second and subsequent writes to the same extent only have the cost of an _eventual_ checksum of the original block list. Maybe If the data isn't already duplicated then the write mapping or the DAX open or the setting of the S_DUP flag could force the file into an extent block that _is_ duplicated. The mental leap required is that the new blocks don't need to belong to the new state being created. The new blocks can be associated to the snapshots since data copy is idempotent. The side note is that it only ever matters if the usage count is greater than one, so at worst taking a snapshot, which is already a _little_ racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files: If S_DAX : If checksum invalid : copy data as-is and checksum, store in snapshot else : look for duplicate checksum if duplicate found : assign that extent to the snapshot else : If file opened for writing and has any mmaps for write : copy extent and assign to new snapshot. else : increment usage count and assign current block to snapshot Anyway, I only know enough of the internals to be dangerous. Since the real goal of mmap is speed during actual update, this idea is basically about amortizing the copy costs into the task of maintaining the snapshots instead of leaving them in the immediate hands of the time-critical updater. The flush, unmmap, or close by the user, or a system-wide sync event, are also good points to expense the bookeeping time.
[PATCH 4/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor btrfs_add_delayed_data_ref(). Signed-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.c | 19 +-- fs/btrfs/delayed-ref.h | 8 +++- fs/btrfs/extent-tree.c | 24 +++- 3 files changed, 27 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index c42b8ade7b07..09caf1e6fc22 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -800,21 +800,27 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref. */ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytenr, u64 num_bytes, - u64 parent, u64 ref_root, - u64 owner, u64 offset, u64 reserved, int action, - int *old_ref_mod, int *new_ref_mod) + struct btrfs_ref *generic_ref, + u64 reserved, int *old_ref_mod, + int *new_ref_mod) { struct btrfs_fs_info *fs_info = trans->fs_info; struct btrfs_delayed_data_ref *ref; struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; + int action = generic_ref->action; int qrecord_inserted; int ret; + u64 bytenr = generic_ref->bytenr; + u64 num_bytes = generic_ref->len; + u64 parent = generic_ref->parent; + u64 ref_root = generic_ref->data_ref.ref_root; + u64 owner = generic_ref->data_ref.ino; + u64 offset = generic_ref->data_ref.offset; u8 ref_type; + ASSERT(generic_ref && generic_ref->type == BTRFS_REF_DATA && action); ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); if (!ref) return -ENOMEM; @@ -838,7 +844,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, } if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) && - is_fstree(ref_root) && is_fstree(root->root_key.objectid)) { + is_fstree(ref_root) && is_fstree(generic_ref->real_root) && + !generic_ref->skip_qgroup) { record = kzalloc(sizeof(*record), GFP_NOFS); if (!record) { kmem_cache_free(btrfs_delayed_data_ref_cachep, ref); diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index dbe029c4e01b..a8fde33b43fd 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -337,11 +337,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_extent_op *extent_op, int *old_ref_mod, int *new_ref_mod); int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytenr, u64 num_bytes, - u64 parent, u64 ref_root, - u64 owner, u64 offset, u64 reserved, int action, - int *old_ref_mod, int *new_ref_mod); + struct btrfs_ref *generic_ref, + u64 reserved, int *old_ref_mod, + int *new_ref_mod); int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ecfa0234863b..fa5dd3dfe2e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2049,10 +2049,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL, _ref_mod, _ref_mod); } else { - ret = btrfs_add_delayed_data_ref(trans, root, bytenr, -num_bytes, parent, -root_objectid, owner, offset, -0, BTRFS_ADD_DELAYED_REF, + btrfs_init_data_ref(_ref, root_objectid, owner, offset); + ret = btrfs_add_delayed_data_ref(trans, _ref, 0, _ref_mod, _ref_mod); } @@ -7114,10 +7112,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL, _ref_mod, _ref_mod); } else { - ret = btrfs_add_delayed_data_ref(trans, root, bytenr, -num_bytes, parent, -
[PATCH 8/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()
Similar to btrfs_inc_extent_ref(), just use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 5 +--- fs/btrfs/extent-tree.c | 53 ++ fs/btrfs/file.c| 23 ++ fs/btrfs/inode.c | 13 +++ fs/btrfs/relocation.c | 26 + 5 files changed, 62 insertions(+), 58 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index db3df5ce6087..9ed55a29993d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2671,10 +2671,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes, u64 flags, int level, int is_data); -int btrfs_free_extent(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, - u64 owner, u64 offset, bool for_reloc); +int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref); int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len, int delalloc); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ff60091aef6b..8a6a73006dc4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3255,10 +3255,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, if (inc) ret = btrfs_inc_extent_ref(trans, _ref); else - ret = btrfs_free_extent(trans, root, bytenr, - num_bytes, parent, ref_root, - key.objectid, key.offset, - for_reloc); + ret = btrfs_free_extent(trans, _ref); if (ret) goto fail; } else { @@ -3272,9 +3269,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, if (inc) ret = btrfs_inc_extent_ref(trans, _ref); else - ret = btrfs_free_extent(trans, root, bytenr, - num_bytes, parent, ref_root, - level - 1, 0, for_reloc); + ret = btrfs_free_extent(trans, _ref); if (ret) goto fail; } @@ -7073,47 +7068,43 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, } /* Can return -ENOMEM */ -int btrfs_free_extent(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, - u64 owner, u64 offset, bool for_reloc) +int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref) { - struct btrfs_fs_info *fs_info = root->fs_info; - struct btrfs_ref generic_ref = { 0 }; + struct btrfs_fs_info *fs_info = trans->fs_info; int old_ref_mod, new_ref_mod; int ret; if (btrfs_is_testing(fs_info)) return 0; - btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr, - num_bytes, root->root_key.objectid, parent); - generic_ref.skip_qgroup = for_reloc; /* * tree log blocks never actually go into the extent allocation * tree, just update pinning info and exit early. */ - if (root_objectid == BTRFS_TREE_LOG_OBJECTID) { - WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID); + if ((ref->type == BTRFS_REF_METADATA && +ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID) || + (ref->type == BTRFS_REF_DATA && +ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)) { /* unlocks the pinned mutex */ - btrfs_pin_extent(fs_info, bytenr, num_bytes, 1); + btrfs_pin_extent(fs_info, ref->bytenr, ref->len, 1); old_ref_mod = new_ref_mod = 0; ret = 0; - } else if (owner < BTRFS_FIRST_FREE_OBJECTID) { - btrfs_init_tree_ref(_ref, (int)owner, root_objectid); - ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL, + } else if (ref->type == BTRFS_REF_METADATA) { + ret = btrfs_add_delayed_tree_ref(trans, ref, NULL, _ref_mod, _ref_mod); } else { - btrfs_init_data_ref(_ref, root_objectid, owner, offset); - ret = btrfs_add_delayed_data_ref(trans, _ref, 0, + ret
[PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
The process_func is never a function hook used anywhere else. Open code it to make later delayed ref refactor easier, so we can refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different patches. Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 33 ++--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ea2c3d5220f0..ea68d288d761 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, int i; int level; int ret = 0; - int (*process_func)(struct btrfs_trans_handle *, - struct btrfs_root *, - u64, u64, u64, u64, u64, u64, bool); - if (btrfs_is_testing(fs_info)) return 0; @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0) return 0; - if (inc) - process_func = btrfs_inc_extent_ref; - else - process_func = btrfs_free_extent; - if (full_backref) parent = buf->start; else @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi); key.offset -= btrfs_file_extent_offset(buf, fi); - ret = process_func(trans, root, bytenr, num_bytes, - parent, ref_root, key.objectid, - key.offset, for_reloc); + if (inc) + ret = btrfs_inc_extent_ref(trans, root, bytenr, + num_bytes, parent, ref_root, + key.objectid, key.offset, + for_reloc); + else + ret = btrfs_free_extent(trans, root, bytenr, + num_bytes, parent, ref_root, + key.objectid, key.offset, + for_reloc); if (ret) goto fail; } else { bytenr = btrfs_node_blockptr(buf, i); num_bytes = fs_info->nodesize; - ret = process_func(trans, root, bytenr, num_bytes, - parent, ref_root, level - 1, 0, - for_reloc); + if (inc) + ret = btrfs_inc_extent_ref(trans, root, bytenr, + num_bytes, parent, ref_root, + level - 1, 0, for_reloc); + else + ret = btrfs_free_extent(trans, root, bytenr, + num_bytes, parent, ref_root, + level - 1, 0, for_reloc); if (ret) goto fail; } -- 2.19.2
[PATCH 7/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
Now we don't need to play the dirty game of reusing @owner for tree block level. Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 6 ++--- fs/btrfs/extent-tree.c | 58 ++ fs/btrfs/file.c| 20 ++- fs/btrfs/inode.c | 10 +--- fs/btrfs/ioctl.c | 17 - fs/btrfs/relocation.c | 44 fs/btrfs/tree-log.c| 12 ++--- 7 files changed, 100 insertions(+), 67 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6f4b1e605736..db3df5ce6087 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -40,6 +40,7 @@ extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; extern struct kmem_cache *btrfs_free_space_cachep; struct btrfs_ordered_sum; +struct btrfs_ref; #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */ @@ -2682,10 +2683,7 @@ int btrfs_free_and_pin_reserved_extent(struct btrfs_fs_info *fs_info, void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info); int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans); int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, -struct btrfs_root *root, -u64 bytenr, u64 num_bytes, u64 parent, -u64 root_objectid, u64 owner, u64 offset, -bool for_reloc); +struct btrfs_ref *generic_ref); int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans); int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 70c05ca30d9a..ff60091aef6b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2026,36 +2026,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, /* Can return -ENOMEM */ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, -struct btrfs_root *root, -u64 bytenr, u64 num_bytes, u64 parent, -u64 root_objectid, u64 owner, u64 offset, -bool for_reloc) +struct btrfs_ref *generic_ref) { - struct btrfs_fs_info *fs_info = root->fs_info; - struct btrfs_ref generic_ref = { 0 }; + struct btrfs_fs_info *fs_info = trans->fs_info; int old_ref_mod, new_ref_mod; int ret; - BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID && - root_objectid == BTRFS_TREE_LOG_OBJECTID); + BUG_ON(generic_ref->type == BTRFS_REF_NOT_SET || + !generic_ref->action); + BUG_ON(generic_ref->type == BTRFS_REF_METADATA && + generic_ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID); - btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr, - num_bytes, root->root_key.objectid, parent); - generic_ref.skip_qgroup = for_reloc; - if (owner < BTRFS_FIRST_FREE_OBJECTID) { - btrfs_init_tree_ref(_ref, (int)owner, root_objectid); - ret = btrfs_add_delayed_tree_ref(trans, _ref, + if (generic_ref->type == BTRFS_REF_METADATA) + ret = btrfs_add_delayed_tree_ref(trans, generic_ref, NULL, _ref_mod, _ref_mod); - } else { - btrfs_init_data_ref(_ref, root_objectid, owner, offset); - ret = btrfs_add_delayed_data_ref(trans, _ref, 0, + else + ret = btrfs_add_delayed_data_ref(trans, generic_ref, 0, _ref_mod, _ref_mod); - } - btrfs_ref_tree_mod(fs_info, _ref); + btrfs_ref_tree_mod(fs_info, generic_ref); if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) - add_pinned_bytes(fs_info, _ref); + add_pinned_bytes(fs_info, generic_ref); return ret; } @@ -3212,8 +3204,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, u32 nritems; struct btrfs_key key; struct btrfs_file_extent_item *fi; + struct btrfs_ref generic_ref = { 0 }; bool for_reloc = btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC); int i; + int action; int level; int ret = 0; @@ -3231,6 +3225,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, parent = buf->start; else parent = 0; + if (inc) + action = BTRFS_ADD_DELAYED_REF; + else + action = BTRFS_DROP_DELAYED_REF; for (i = 0; i < nritems; i++) { if (level == 0) { @@ -3248,11 +3246,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi); key.offset -= btrfs_file_extent_offset(buf, fi); + btrfs_init_generic_ref(_ref,
[PATCH 5/8] btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as btrfs_ref describes a metadata/data reference update comprehensively. Now we have one less function use confusing owner/level trick. Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 27 +++-- fs/btrfs/ref-verify.c | 53 -- fs/btrfs/ref-verify.h | 10 3 files changed, 42 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index fa5dd3dfe2e7..1d812bc2c7fc 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2038,9 +2038,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID && root_objectid == BTRFS_TREE_LOG_OBJECTID); - btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid, - owner, offset, BTRFS_ADD_DELAYED_REF); - btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr, num_bytes, root->root_key.objectid, parent); generic_ref.skip_qgroup = for_reloc; @@ -2054,6 +2051,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, _ref_mod, _ref_mod); } + btrfs_ref_tree_mod(fs_info, _ref); + if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) { bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID; @@ -7025,10 +7024,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { int old_ref_mod, new_ref_mod; - btrfs_ref_tree_mod(root, buf->start, buf->len, parent, - root->root_key.objectid, - btrfs_header_level(buf), 0, - BTRFS_DROP_DELAYED_REF); + btrfs_ref_tree_mod(fs_info, _ref); ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL, _ref_mod, _ref_mod); BUG_ON(ret); /* -ENOMEM */ @@ -7089,11 +7085,6 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, if (btrfs_is_testing(fs_info)) return 0; - if (root_objectid != BTRFS_TREE_LOG_OBJECTID) - btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, - root_objectid, owner, offset, - BTRFS_DROP_DELAYED_REF); - btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr, num_bytes, root->root_key.objectid, parent); generic_ref.skip_qgroup = for_reloc; @@ -7117,6 +7108,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, _ref_mod, _ref_mod); } + if (root_objectid != BTRFS_TREE_LOG_OBJECTID) + btrfs_ref_tree_mod(fs_info, _ref); + if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) { bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID; @@ -8083,14 +8077,11 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans, BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID); - btrfs_ref_tree_mod(root, ins->objectid, ins->offset, 0, - root->root_key.objectid, owner, offset, - BTRFS_ADD_DELAYED_EXTENT); - btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT, ins->objectid, ins->offset, root->root_key.objectid, 0); btrfs_init_data_ref(_ref, root->root_key.objectid, owner, offset); + btrfs_ref_tree_mod(root->fs_info, _ref); ret = btrfs_add_delayed_data_ref(trans, _ref, ram_bytes, NULL, NULL); return ret; @@ -8338,13 +8329,11 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans, extent_op->is_data = false; extent_op->level = level; - btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent, - root_objectid, level, 0, - BTRFS_ADD_DELAYED_EXTENT); btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT, ins.objectid, ins.offset, root->root_key.objectid, parent); btrfs_init_tree_ref(_ref, level, root_objectid); + btrfs_ref_tree_mod(fs_info, _ref); ret = btrfs_add_delayed_tree_ref(trans, _ref, extent_op, NULL, NULL); if (ret) diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c index d69fbfb30aa9..f6c15b612570 100644 --- a/fs/btrfs/ref-verify.c +++ b/fs/btrfs/ref-verify.c @@ -670,36 +670,43 @@ static
[PATCH 3/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and some caller like btrfs_inc_extent_ref() are using @owner as level for delayed tree ref. Instead of making the parameter list longer and longer, use btrfs_ref to refactor it, so each parameter assignment should be self-explaining without dirty level/owner trick, and provides the basis for later refactor. Signed-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.c | 24 ++--- fs/btrfs/delayed-ref.h | 4 +--- fs/btrfs/extent-tree.c | 48 -- 3 files changed, 44 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 11dd46be4017..c42b8ade7b07 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -710,9 +710,7 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info, * transaction commits. */ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, - u64 bytenr, u64 num_bytes, u64 parent, - u64 ref_root, int level, bool for_reloc, - int action, + struct btrfs_ref *generic_ref, struct btrfs_delayed_extent_op *extent_op, int *old_ref_mod, int *new_ref_mod) { @@ -722,10 +720,17 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_root *delayed_refs; struct btrfs_qgroup_extent_record *record = NULL; int qrecord_inserted; - bool is_system = (ref_root == BTRFS_CHUNK_TREE_OBJECTID); + bool is_system = (generic_ref->real_root == BTRFS_CHUNK_TREE_OBJECTID); + int action = generic_ref->action; + int level = generic_ref->tree_ref.level; int ret; + u64 bytenr = generic_ref->bytenr; + u64 num_bytes = generic_ref->len; + u64 parent = generic_ref->parent; u8 ref_type; + ASSERT(generic_ref && generic_ref->type == BTRFS_REF_METADATA && + generic_ref->action); BUG_ON(extent_op && extent_op->is_data); ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS); if (!ref) @@ -738,7 +743,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, } if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) && - is_fstree(ref_root) && !for_reloc) { + is_fstree(generic_ref->real_root) && + is_fstree(generic_ref->tree_ref.root) && + !generic_ref->skip_qgroup) { record = kzalloc(sizeof(*record), GFP_NOFS); if (!record) { kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref); @@ -753,13 +760,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, ref_type = BTRFS_TREE_BLOCK_REF_KEY; init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, - ref_root, action, ref_type); - ref->root = ref_root; + generic_ref->tree_ref.root, action, ref_type); + ref->root = generic_ref->tree_ref.root; ref->parent = parent; ref->level = level; init_delayed_ref_head(head_ref, record, bytenr, num_bytes, - ref_root, 0, action, false, is_system); + generic_ref->tree_ref.root, 0, action, false, + is_system); head_ref->extent_op = extent_op; delayed_refs = >transaction->delayed_refs; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index e36d6b05d85e..dbe029c4e01b 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -333,9 +333,7 @@ static inline void btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *hea } int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, - u64 bytenr, u64 num_bytes, u64 parent, - u64 ref_root, int level, bool for_reloc, - int action, + struct btrfs_ref *generic_ref, struct btrfs_delayed_extent_op *extent_op, int *old_ref_mod, int *new_ref_mod); int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ea68d288d761..ecfa0234863b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2031,6 +2031,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, bool for_reloc) { struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_ref generic_ref = { 0 }; int old_ref_mod, new_ref_mod; int ret; @@ -2040,13 +2041,13 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
[PATCH 1/8] btrfs: delayed-ref: Introduce better documented delayed ref structures
Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ref_root owner offset for_reloc << Only qgroup code cares. - Different interpretation for the same parameter Above @owner for data ref is ino owning this extent, while for tree ref, it's level. They are even in different size range. For level we only need 0~8, while for ino it's BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even makes sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered this extent modification. This patch doesn't touch any code, but provides the basis for incoming refactors. Signed-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.h | 109 + 1 file changed, 109 insertions(+) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index d8fa12d3f2cc..e36d6b05d85e 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -176,6 +176,81 @@ struct btrfs_delayed_ref_root { u64 qgroup_to_skip; }; +enum btrfs_ref_type { + BTRFS_REF_NOT_SET = 0, + BTRFS_REF_DATA, + BTRFS_REF_METADATA, + BTRFS_REF_LAST, +}; + +struct btrfs_data_ref { + /* +* For EXTENT_DATA_REF +* +* @ref_root: current owner of the extent. +* may differ from btrfs_ref::real_root. +* @ino:inode number of the owner. +* @offset: *CALCULATED* offset. Not EXTENT_DATA key offset. +* +*/ + u64 ref_root; + u64 ino; + u64 offset; +}; + +struct btrfs_tree_ref { + /* Common for all sub types and skinny combination */ + int level; + + /* +* For TREE_BLOCK_REF (skinny metadata, either inline or keyed) +* +* root here may differ from btrfs_ref::real_root. +*/ + u64 root; + + /* For non-skinny metadata, no special member needed */ +}; + +struct btrfs_ref { + enum btrfs_ref_type type; + int action; + + /* +* Use full backref(SHARED_BLOCK_REF or SHARED_DATA_REF) for this +* extent and its children. +* Set for reloc trees. +*/ + unsigned int use_fullback:1; + + /* +* Whether this extent should go through qgroup record. +* Normally false, but for certain case like delayed subtree scan, +* this can hugely reduce qgroup overhead. +*/ + unsigned int skip_qgroup:1; + + /* +* Who owns this reference modification, optional. +* +* One example: +* When creating reloc tree for source fs, it will increase tree block +* ref for children tree blocks. +* In that case, btrfs_ref::real_root = reloc tree, +* while btrfs_ref::tree_ref::root = fs tree. +*/ + u64 real_root; + u64 bytenr; + u64 len; + + /* Common @parent for SHARED_DATA_REF/SHARED_BLOCK_REF */ + u64 parent; + union { + struct btrfs_data_ref data_ref; + struct btrfs_tree_ref tree_ref; + }; +}; + extern struct kmem_cache *btrfs_delayed_ref_head_cachep; extern struct kmem_cache *btrfs_delayed_tree_ref_cachep; extern struct kmem_cache *btrfs_delayed_data_ref_cachep; @@ -184,6 +259,40 @@ extern struct kmem_cache *btrfs_delayed_extent_op_cachep; int __init btrfs_delayed_ref_init(void); void __cold btrfs_delayed_ref_exit(void); +static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref, + int action, u64 bytenr, u64 len, u64 real_root, + u64 parent) +{ + generic_ref->action = action; + generic_ref->bytenr = bytenr; + generic_ref->len = len; + generic_ref->real_root = real_root; + generic_ref->parent = parent; +} + +static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref, + int level, u64 root) +{ + /* If @real_root not set, use @root as fallback */ + if (!generic_ref->real_root) + generic_ref->real_root = root; + generic_ref->tree_ref.level = level; + generic_ref->tree_ref.root = root; + generic_ref->type = BTRFS_REF_METADATA; +} + +static inline void btrfs_init_data_ref(struct btrfs_ref
[PATCH 6/8] btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()
Since add_pinned_bytes() only needs to know if the extent is metadata and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as we don't need various owner/level trick to determine extent type. Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 26 ++ 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1d812bc2c7fc..70c05ca30d9a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -738,14 +738,15 @@ static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info, return NULL; } -static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes, -bool metadata, u64 root_objectid) +static void add_pinned_bytes(struct btrfs_fs_info *fs_info, +struct btrfs_ref *ref) { struct btrfs_space_info *space_info; + s64 num_bytes = -ref->len; u64 flags; - if (metadata) { - if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID) + if (ref->type == BTRFS_REF_METADATA) { + if (ref->tree_ref.root == BTRFS_CHUNK_TREE_OBJECTID) flags = BTRFS_BLOCK_GROUP_SYSTEM; else flags = BTRFS_BLOCK_GROUP_METADATA; @@ -2053,11 +2054,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, btrfs_ref_tree_mod(fs_info, _ref); - if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) { - bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID; - - add_pinned_bytes(fs_info, -num_bytes, metadata, root_objectid); - } + if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) + add_pinned_bytes(fs_info, _ref); return ret; } @@ -7059,8 +7057,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, } out: if (pin) - add_pinned_bytes(fs_info, buf->len, true, -root->root_key.objectid); + add_pinned_bytes(fs_info, _ref); if (last_ref) { /* @@ -7111,11 +7108,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, if (root_objectid != BTRFS_TREE_LOG_OBJECTID) btrfs_ref_tree_mod(fs_info, _ref); - if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) { - bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID; - - add_pinned_bytes(fs_info, num_bytes, metadata, root_objectid); - } + if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) + add_pinned_bytes(fs_info, _ref); return ret; } -- 2.19.2
[PATCH 0/8] btrfs: Refactor delayed ref parameter list
Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent So far so good ref_root owner offset I don't feel well now for_reloc This parameter only makes sense for qgroup code, but we need to pass the parameter a long way. This makes later expand on parameter list more and more tricky. - Different interpretation for the same parameter Above @owner for data ref is ino who owns this extent, while for tree ref, it's level. They are even in different size range. For level we only need 0~8, while for ino it's BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID, so it's still possible to distinguish them, but it's never a straight-forward thing to grasp. And @offset doesn't even makes sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. This patchset will change the way how we pass parameters for delayed ref. Instead of calling delayed ref interface like: ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, parent, ref_root, owner, offset); Or ret = btrfs_inc_extent_ref(trans, root, bytenr, nodesize, parent, level, ref_root, 0); We now call like: btrfs_init_generic_ref(, bytenr, num_bytes, root->root_key.objectid, parent); btrfs_init_data_ref(, ref_root, owner, offset); ret = btrfs_inc_extent_ref(trans, ); Or btrfs_init_generic_ref(, bytenr, num_bytes, root->root_key.objectid, parent); btrfs_init_tree_ref(, level, ref_root); ret = btrfs_inc_extent_ref(trans, ); To determine if a ref is tree or data, instead of calling like: if (owner < BTRFS_FIRST_FREE_OBJECTID) { } else { } We do it straight-forward: if (ref->type == BTRFS_REF_METADATA) { } else { } And for newer and minor new members, we don't need to add a new parameter to btrfs_add_delayed_tree|data_ref() or btrfs_inc_extent_ref(), just assign them after generic/data/tree init: btrfs_init_generic_ref(, bytenr, num_bytes, root->root_key.objectid, parent); btrfs_init_data_ref(, ref_root, owner, offset); ref->skip_qgroup = true; /* @skip_qgroup is default to false, so new code doesn't need to care */ ret = btrfs_inc_extent_ref(trans, ); This should improve the code readability and make later code easier to write. Qu Wenruo (8): btrfs: delayed-ref: Introduce better documented delayed ref structures btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref() btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref() btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes() btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() fs/btrfs/ctree.h | 11 +-- fs/btrfs/delayed-ref.c | 43 ++--- fs/btrfs/delayed-ref.h | 121 +++-- fs/btrfs/extent-tree.c | 195 +++-- fs/btrfs/file.c| 43 + fs/btrfs/inode.c | 23 +++-- fs/btrfs/ioctl.c | 17 ++-- fs/btrfs/ref-verify.c | 53 ++- fs/btrfs/ref-verify.h | 10 +-- fs/btrfs/relocation.c | 70 +-- fs/btrfs/tree-log.c| 12 ++- 11 files changed, 375 insertions(+), 223 deletions(-) -- 2.19.2
Re: What if TRIM issued a wipe on devices that don't TRIM?
On Thu, 6 Dec 2018 06:11:46 + Robert White wrote: > So it would be dog-slow, but it would be neat if BTRFS had a mount > option to convert any TRIM command from above into the write of a zero, > 0xFF, or trash block to the device below if that device doesn't support > TRIM. Real TRIM support would override the block write. There is such a project: "dm-linear like target which provides discard, but replaces it with write of random data to a discarded region. Thus, discarded data is securely deleted." https://github.com/vt-alt/dm-secdel -- With respect, Roman
What if TRIM issued a wipe on devices that don't TRIM?
(1) Automatic and selective wiping of unused and previously used disk blocks is a good security measure, particularly when there is an encryption layer beneath the file system. (2) USB attached devices _never_ support TRIM and they are the most likely to fall into strangers hands. (3) I vaguely recall that some flash chips will take bulk writhes of full sectors of 0x00 or 0xFF (I don't remember which) were second-best to TRIM for letting the flash controllers defragment their internals. So it would be dog-slow, but it would be neat if BTRFS had a mount option to convert any TRIM command from above into the write of a zero, 0xFF, or trash block to the device below if that device doesn't support TRIM. Real TRIM support would override the block write. Obviously doing an fstrim would involve a lot of slow device writes but only for people likely to do that sort of thing. For testing purposes the destruction of unused pages in this manner might catch file system failures or coding errors. (The other layer where this might be most appropriate is in cryptsetup et al, where it could lie about TRIM support, but that sort of stealth lag might be bad for filesystem-level operations. Doing it there would also loose the simpler USB use cases.) ...Just a thought... --Rob White.
Re:Attn.
Hello, My name is ms. Reem Al-Hashimi. The UAE minister of state for international cooperation. I got your contact from an email database from your country. I have a financial transaction i would like to discuss with you. Please reply to reem2...@daum.net, for more details if you are interested. Regards, Ms. Reem Al-Hashimi
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote: This is a support for DAX in btrfs. I understand there have been previous attempts at it. However, I wanted to make sure copy-on-write (COW) works on dax as well. Before I present this to the FS folks I wanted to run this through the btrfs. Even though I wish, I cannot get it correct the first time around :/.. Here are some questions for which I need suggestions: Questions: 1. I have been unable to do checksumming for DAX devices. While checksumming can be done for reads and writes, it is a problem when mmap is involved because btrfs kernel module does not get back control after an mmap() writes. Any ideas are appreciated, or we would have to set nodatasum when dax is enabled. Yep. It has to be nodatasum, at least within the confines of datasum today. DAX mmap writes are essentially in the same situation as with direct i/o when another thread modifies the buffer being submitted. Except rather than it being a race, it happens every time. An alternative here could be to add the ability to mark a crc as unreliable and then go back and update them once the last DAX mmap reference is dropped on a range. There's no reason to make this a requirement of the initial implementation, though. 2. Currently, a user can continue writing on "old" extents of an mmaped file after a snapshot has been created. How can we enforce writes to be directed to new extents after snapshots have been created? Do we keep a list of all mmap()s, and re-mmap them after a snapshot? It's the second question that's the hard part. As Adam describes later, setting each pfn read-only will ensure page faults cause the remapping. The high level idea that Jan Kara and I came up with in our conversation at Labs conf is pretty expensive. We'd need to set a flag that pauses new page faults, set the WP bit on affected ranges, do the snapshot, commit, clear the flag, and wake up the waiting threads. Neither of us had any concrete idea of how well that would perform and it still depends on finding a good way to resolve all open mmap ranges on a subvolume. Perhaps using the address_space->private_list anchored on each root would work. -Jeff Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel command line parameter. [PATCH 01/10] btrfs: create a mount option for dax [PATCH 02/10] btrfs: basic dax read [PATCH 03/10] btrfs: dax: read zeros from holes [PATCH 04/10] Rename __endio_write_update_ordered() to [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of [PATCH 06/10] btrfs: dax write support [PATCH 07/10] dax: export functions for use with btrfs [PATCH 08/10] btrfs: dax add read mmap path [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared [PATCH 10/10] btrfs: dax mmap write fs/btrfs/Makefile |1 fs/btrfs/ctree.h| 17 ++ fs/btrfs/dax.c | 303 ++-- fs/btrfs/file.c | 29 fs/btrfs/inode.c| 54 + fs/btrfs/ioctl.c|5 fs/btrfs/super.c| 15 ++ fs/dax.c| 35 -- include/linux/dax.h | 16 ++ 9 files changed, 430 insertions(+), 45 deletions(-) -- Jeff Mahoney SUSE Labs
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 12/5/18 8:03 AM, Qu Wenruo wrote: On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote: This is a support for DAX in btrfs. I understand there have been previous attempts at it. However, I wanted to make sure copy-on-write (COW) works on dax as well. Before I present this to the FS folks I wanted to run this through the btrfs. Even though I wish, I cannot get it correct the first time around :/.. Here are some questions for which I need suggestions: Questions: 1. I have been unable to do checksumming for DAX devices. While checksumming can be done for reads and writes, it is a problem when mmap is involved because btrfs kernel module does not get back control after an mmap() writes. Any ideas are appreciated, or we would have to set nodatasum when dax is enabled. I'm not familar with DAX, so it's completely possible I'm talking like an idiot. The general idea is: 1) there is no page cache involved. read() and write() are like direct i/o writes in concept. The user buffer is written directly (via what is essentially a specialized memcpy) to the NVDIMM. 2) for mmap, once the mapping is established and mapped, the file system is not involved. The application writes directly to the memory as it would a normal mmap, except it's persistent. All that's required to ensure persistence is a CPU cache flush. The only way the file system is involved again is if some operation has occurred to reset the WP bit. If btrfs_page_mkwrite() can't provide enough control, then I have a crazy idea. It can't, because it is only invoked on the page fault path and we want to try to limit those as much as possible. Forcing page fault for every mmap() read/write (completely disable page cache like DIO). So that we could get some control since we're informed to read the page and do some hacks there. There's no way to force a page fault for every mmap read/write. Even if there was, we wouldn't want that. No user would turn that on when they can just make similar guarantees in their app (which are typically apps that do this already) and not pay any performance penalty. The idea with DAX mmap is that the file system manages the namespace, space allocation, and permissions. Otherwise we stay out of the way. -Jeff -- Jeff Mahoney SUSE Labs
Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot
On Wed, Dec 5, 2018 at 5:14 PM Josef Bacik wrote: > > From: Josef Bacik > > With my delayed refs patches in place we started seeing a large amount > of aborts in __btrfs_free_extent > > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root > 35964 owner 1 offset 0 > Call Trace: > ? btrfs_merge_delayed_refs+0xaf/0x340 > __btrfs_run_delayed_refs+0x6ea/0xfc0 > ? btrfs_set_path_blocking+0x31/0x60 > btrfs_run_delayed_refs+0xeb/0x180 > btrfs_commit_transaction+0x179/0x7f0 > ? btrfs_check_space_for_delayed_refs+0x30/0x50 > ? should_end_transaction.isra.19+0xe/0x40 > btrfs_drop_snapshot+0x41c/0x7c0 > btrfs_clean_one_deleted_snapshot+0xb5/0xd0 > cleaner_kthread+0xf6/0x120 > kthread+0xf8/0x130 > ? btree_invalidatepage+0x90/0x90 > ? kthread_bind+0x10/0x10 > ret_from_fork+0x35/0x40 > > This was because btrfs_drop_snapshot depends on the root not being modified > while it's dropping the snapshot. It will unlock the root node (and really > every node) as it walks down the tree, only to re-lock it when it needs to do > something. This is a problem because if we modify the tree we could cow a > block > in our path, which free's our reference to that block. Then once we get back > to > that shared block we'll free our reference to it again, and get ENOENT when > trying to lookup our extent reference to that block in __btrfs_free_extent. > > This is ultimately happening because we have delayed items left to be > processed > for our deleted snapshot _after_ all of the inodes are closed for the > snapshot. > We only run the delayed inode item if we're deleting the inode, and even then > we > do not run the delayed insertions or delayed removals. These can be run at > any > point after our final inode does it's last iput, which is what triggers the > snapshot deletion. We can end up with the snapshot deletion happening and > then > have the delayed items run on that file system, resulting in the above > problem. > > This problem has existed forever, however my patches made it much easier to > hit > as I wake up the cleaner much more often to deal with delayed iputs, which > made > us more likely to start the snapshot dropping work before the transaction > commits, which is when the delayed items would generally be run. Before, > generally speaking, we would run the delayed items, commit the transaction, > and > wakeup the cleaner thread to start deleting snapshots, which means we were > less > likely to hit this problem. You could still hit it if you had multiple > snapshots to be deleted and ended up with lots of delayed items, but it was > definitely harder. > > Fix for now by simply running all the delayed items before starting to drop > the > snapshot. We could make this smarter in the future by making the delayed > items > per-root, and then simply drop any delayed items for roots that we are going > to > delete. But for now just a quick and easy solution is the safest. > > Cc: sta...@vger.kernel.org > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana Looks good now. Thanks. > --- > v1->v2: > - check for errors from btrfs_run_delayed_items. > - Dave I can reroll the series, but the second version of patch 1 is the same, > let me know what you want. > > fs/btrfs/extent-tree.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index dcb699dd57f3..473084eb7a2d 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > goto out_free; > } > > + err = btrfs_run_delayed_items(trans); > + if (err) > + goto out_end_trans; > + > if (block_rsv) > trans->block_rsv = block_rsv; > > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: btrfs progs always assume devid 1?
On 2018-12-05 14:50, Roman Mamedov wrote: Hello, To migrate my FS to a different physical disk, I have added a new empty device to the FS, then ran the remove operation on the original one. Now my FS has only devid 2: Label: 'p1' uuid: d886c190-b383-45ba-9272-9f00c6a10c50 Total devices 1 FS bytes used 36.63GiB devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1 And all the operations of btrfs-progs now fail to work in their default invocation, such as: # btrfs fi resize max . Resize '.' of 'max' ERROR: unable to resize '.': No such device [768813.414821] BTRFS info (device dm-5): resizer unable to find device 1 Of course this works: # btrfs fi resize 2:max . Resize '.' of '2:max' But this is inconvenient and seems to be a rather simple oversight. If what I got is normal (the device staying as ID 2 after such operation), then count that as a suggestion that btrfs-progs should use the first existing devid, rather than always looking for hard-coded devid 1. I've been meaning to try and write up a patch to special-case this for a while now, but have not gotten around to it yet. FWIW, this is one of multiple reasons that it's highly recommended to use `btrfs replace` instead of adding a new device and deleting the old one when replacing a device. Other benefits include: * It doesn't have to run in the foreground (and doesn't by default). * It usually takes less time. * Replace operations can be queried while running to get a nice indication of the completion percentage. The only disadvantage is that the new device has to be at least as large as the old one (though you can get around this to a limited degree by shrinking the old device), and it needs the old and new device to be plugged in at the same time (add/delete doesn't, if you flip the order of the add and delete commands).
btrfs progs always assume devid 1?
Hello, To migrate my FS to a different physical disk, I have added a new empty device to the FS, then ran the remove operation on the original one. Now my FS has only devid 2: Label: 'p1' uuid: d886c190-b383-45ba-9272-9f00c6a10c50 Total devices 1 FS bytes used 36.63GiB devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1 And all the operations of btrfs-progs now fail to work in their default invocation, such as: # btrfs fi resize max . Resize '.' of 'max' ERROR: unable to resize '.': No such device [768813.414821] BTRFS info (device dm-5): resizer unable to find device 1 Of course this works: # btrfs fi resize 2:max . Resize '.' of '2:max' But this is inconvenient and seems to be a rather simple oversight. If what I got is normal (the device staying as ID 2 after such operation), then count that as a suggestion that btrfs-progs should use the first existing devid, rather than always looking for hard-coded devid 1. -- With respect, Roman
Re: Linux-next regression?
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote: > On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote: >> I think (hope) this is: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Which was just nailed down to a blkmq bug. It triggers when you have >> scsi devices using elevator=none over blkmq. > > Thanks a lot Chris. Really. > Good news: I confirm I recompiled and used blkmq and no-op (at that > time). > Also, the massive write of btrfs defrag can explain the massive > trigger of > the bug, and next corruption. Sorry this happened, but glad you were able to confirm that it explains the trouble you hit. Thanks for the report, I did end up using this as a datapoint to convince myself the bugzilla above wasn't ext4 specific. -chris
[PATCH][v2] btrfs: run delayed items before dropping the snapshot
From: Josef Bacik With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which free's our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does it's last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. Cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- v1->v2: - check for errors from btrfs_run_delayed_items. - Dave I can reroll the series, but the second version of patch 1 is the same, let me know what you want. fs/btrfs/extent-tree.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dcb699dd57f3..473084eb7a2d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root, goto out_free; } + err = btrfs_run_delayed_items(trans); + if (err) + goto out_end_trans; + if (block_rsv) trans->block_rsv = block_rsv; -- 2.14.3
Re: [RFC PATCH] btrfs: Remove __extent_readpages
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an > example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller > than > the memory, so pages should have been nicely readahead. > What this patch changes is now we aren't reading all of the pages we are passed by the readahead, so now we fall back to per-page reading when we go to read those pages because the readahead window has already moved past them. This isn't great behavior, what we have is nice in that it tries to group the entire range together as much as possible. What your patch changes is that as soon as we stop having a contiguous range we just bail and submit what we have. Testing it in isolation is likely to show very little change, but having recently touched the fault in code where we definitely do not count major faults in some cases I'd suspect that we're not reflecting this higher fault rate in the performance counters properly. We should preserve the existing behavior, what hurts a little bit on a lightly loaded box is going to hurt a whole lot more on a heavily loaded box. Thanks, Josef
Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
On Wed, Nov 21, 2018 at 05:10:52PM +0200, Nikolay Borisov wrote: > Running btrfs/124 in a loop hung up on me sporadically with the > following call trace: > btrfs D0 5760 5324 0x > Call Trace: >? __schedule+0x243/0x800 >schedule+0x33/0x90 >btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs] >? wait_woken+0xa0/0xa0 >btrfs_wait_ordered_range+0xbb/0x100 [btrfs] >btrfs_relocate_block_group+0x1ff/0x230 [btrfs] >btrfs_relocate_chunk+0x49/0x100 [btrfs] >btrfs_balance+0xbeb/0x1740 [btrfs] >btrfs_ioctl_balance+0x2ee/0x380 [btrfs] >btrfs_ioctl+0x1691/0x3110 [btrfs] >? lockdep_hardirqs_on+0xed/0x180 >? __handle_mm_fault+0x8e7/0xfb0 >? _raw_spin_unlock+0x24/0x30 >? __handle_mm_fault+0x8e7/0xfb0 >? do_vfs_ioctl+0xa5/0x6e0 >? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] >do_vfs_ioctl+0xa5/0x6e0 >? entry_SYSCALL_64_after_hwframe+0x3e/0xbe >ksys_ioctl+0x3a/0x70 >__x64_sys_ioctl+0x16/0x20 >do_syscall_64+0x60/0x1b0 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > This happens because during page writeback it's valid for > writepage_delalloc to instantiate a delalloc range which doesn't > belong to the page currently being written back. > > The reason this case is valid is due to find_lock_delalloc_range > returning any available range after the passed delalloc_start and > ignorting whether the page under writeback is within that range. > In turn ordered extents (OE) are always created for the returned range > from find_lock_delalloc_range. If, however, a failure occurs while OE > are being created then the clean up code in btrfs_cleanup_ordered_extents > will be called. > > Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider > the case of such 'foreign' range being processed and instead it always > assumes that the range OE are created for belongs to the page. This > leads to the first page of such foregin range to not be cleaned up since > it's deliberately missed skipped by the current cleaning up code. > > Fix this by correctly checking whether the current page belongs to the > range being instantiated and if so adjsut the range parameters > passed for cleaning up. If it doesn't, then just clean the whole OE > range directly. > > Signed-off-by: Nikolay Borisov > Reviewed-by: Josef Bacik Added to misc-next, thanks.
Re: [PATCH 01/10] btrfs: create a mount option for dax
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote: > One question below though . > > > +++ b/fs/btrfs/super.c > > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > > char *options, > > case Opt_user_subvol_rm_allowed: > > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); > > break; > > +#ifdef CONFIG_FS_DAX > > + case Opt_dax: > > + if (btrfs_super_num_devices(info->super_copy) > 1) { > > + btrfs_info(info, > > + "dax not supported for multi-device > > btrfs partition\n"); > > What prevents supporting dax for multiple devices so long as all devices > are dax? As I mentioned in a separate mail, most profiles are either redundant (RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5, RAID6). But, "single" profile multi-device would be useful and actually provide something other dax-supporting filesystems don't have: combining multiple devices into one logical piece. On the other hand, DUP profiles need to be banned. In particular, the filesystem you mount might have existing DUP block groups. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in ⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned ⠈⠳⣄ to the city of his birth to die.
Re: [PATCH 07/10] dax: export functions for use with btrfs
If you want to export these at all they have to be EXPORT_SYMBOL_GPL. But I'd really like to avoid seeing another duplicate DAX I/O path. Please try to adopt the existing iomap-based infrastructure for your needs first.
Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci
On Wed, 5 Dec 2018, Johannes Thumshirn wrote: > Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can be > replaced by the offset_in_page() macro instead of open-coding it. > > Add a coccinelle semantic patch to ease detection and conversion of these. > > This unfortunately doesn't account for the case when we want PAGE_ALIGNED() > instead of offset_in_page() yet. > > Cc: Julia Lawall > Signed-off-by: Johannes Thumshirn > --- > scripts/coccinelle/api/offset_in_page.cocci | 81 > + > 1 file changed, 81 insertions(+) > create mode 100644 scripts/coccinelle/api/offset_in_page.cocci > > diff --git a/scripts/coccinelle/api/offset_in_page.cocci > b/scripts/coccinelle/api/offset_in_page.cocci > new file mode 100644 > index ..ea5b3a8e0390 > --- /dev/null > +++ b/scripts/coccinelle/api/offset_in_page.cocci > @@ -0,0 +1,81 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/// > +/// Use offset_in_page macro on address instead of explicit computation. > +/// > +// Confidence: High > +// Keywords: offset_in_page > +// Comment: Based on vma_pages.cocci > + > +virtual context > +virtual patch > +virtual org > +virtual report > + > + > +//-- > +// For context mode > +//-- > + > +@r_context depends on context && !patch && !org && !report@ > +expression E; > +@@ > + > +( > +* E & ~PAGE_MASK > +| > +* E & (PAGE_SIZE - 1) > +) > + > + > +//-- > +// For patch mode > +//-- > + > +@r_patch depends on !context && patch && !org && !report@ > +expression E; > +type T; > +@@ > + > +( > +- E & ~PAGE_MASK > ++ offset_in_page(E) > +| > +- E & (PAGE_SIZE - 1) > ++ offset_in_page(E) The two lines above should be subsumed by the two lines below. When there is a type metavariable that has no other dependencies, an isomorphism will consider that it is either present or absent. Why not include the cast case for the context and org cases? Masahiro will ultimately commit this. I have added him to CC. Thanks for the contribution. julia > +| > +- E & ((T)PAGE_SIZE - 1) > ++ offset_in_page(E) > +) > + > +//-- > +// For org mode > +//-- > + > +@r_org depends on !context && !patch && (org || report)@ > +expression E; > +position p; > +@@ > + > + ( > + * E@p & ~PAGE_MASK > + | > + * E@p & (PAGE_SIZE - 1) > + ) > + > +@script:python depends on report@ > +p << r_org.p; > +x << r_org.E; > +@@ > + > +msg="WARNING: Consider using offset_in_page helper on %s" % (x) > +coccilib.report.print_report(p[0], msg) > + > +@script:python depends on org@ > +p << r_org.p; > +x << r_org.E; > +@@ > + > +msg="WARNING: Consider using offset_in_page helper on %s" % (x) > +msg_safe=msg.replace("[","@(").replace("]",")") > +coccilib.org.print_todo(p[0], msg_safe) > + > -- > 2.16.4 > >
Fix for https://bugzilla.kernel.org/show_bug.cgi?id=200085
Hello Is it correct way to fix this issue as add max retry count https://github.com/kdave/btrfs-progs/pull/151/files I have added max retry count ? is it correct way to fix this ? Thanks AK
Re: [PATCH 1/3] btrfs: use offset_in_page instead of open-coding it
On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote: > Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an > offset into a page. > > So replace them by the offset_in_page() macro instead of open-coding it if > they're not used as an alignment check. > > Signed-off-by: Johannes Thumshirn Reviewed-by: Nikolay Borisov > --- > fs/btrfs/check-integrity.c | 12 +-- > fs/btrfs/compression.c | 2 +- > fs/btrfs/extent_io.c | 53 > +- > fs/btrfs/file.c| 4 ++-- > fs/btrfs/inode.c | 7 +++--- > fs/btrfs/send.c| 2 +- > fs/btrfs/volumes.c | 2 +- > 7 files changed, 38 insertions(+), 44 deletions(-) > > diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c > index 781cae168d2a..d319c3020c09 100644 > --- a/fs/btrfs/check-integrity.c > +++ b/fs/btrfs/check-integrity.c > @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data( > void *dstv, u32 offset, size_t len) > { > size_t cur; > - size_t offset_in_page; > + size_t pgoff; > char *kaddr; > char *dst = (char *)dstv; > - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1); > + size_t start_offset = offset_in_page(block_ctx->start); > unsigned long i = (start_offset + offset) >> PAGE_SHIFT; > > WARN_ON(offset + len > block_ctx->len); > - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1); > + pgoff = offset_in_page(start_offset + offset); > > while (len > 0) { > - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page)); > + cur = min(len, ((size_t)PAGE_SIZE - pgoff)); > BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE)); > kaddr = block_ctx->datav[i]; > - memcpy(dst, kaddr + offset_in_page, cur); > + memcpy(dst, kaddr + pgoff, cur); > > dst += cur; > len -= cur; > - offset_in_page = 0; > + pgoff = 0; > i++; > } > } > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index dba59ae914b8..2ab5591449f2 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, > > if (page->index == end_index) { > char *userpage; > - size_t zero_offset = isize & (PAGE_SIZE - 1); > + size_t zero_offset = offset_in_page(isize); > > if (zero_offset) { > int zeros; > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index b2769e92b556..e365c5272e6b 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio) > unsigned off; > > /* Zero out the end if this page straddles i_size */ > - off = i_size & (PAGE_SIZE-1); > + off = offset_in_page(i_size); > if (page->index == end_index && off) > zero_user_segment(page, off, PAGE_SIZE); > SetPageUptodate(page); > @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree, > > if (page->index == last_byte >> PAGE_SHIFT) { > char *userpage; > - size_t zero_offset = last_byte & (PAGE_SIZE - 1); > + size_t zero_offset = offset_in_page(last_byte); > > if (zero_offset) { > iosize = PAGE_SIZE - zero_offset; > @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct > writeback_control *wbc, > > ClearPageError(page); > > - pg_offset = i_size & (PAGE_SIZE - 1); > + pg_offset = offset_in_page(i_size); > if (page->index > end_index || > (page->index == end_index && !pg_offset)) { > page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE); > @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, > void *dstv, > struct page *page; > char *kaddr; > char *dst = (char *)dstv; > - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1); > + size_t start_offset = offset_in_page(eb->start); > unsigned long i = (start_offset + start) >> PAGE_SHIFT; > > if (start + len > eb->len) { > @@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, > void *dstv, > return; > } > > - offset = (start_offset + start) & (PAGE_SIZE - 1); > + offset = offset_in_page(start_offset + start); > > while (len > 0) { > page = eb->pages[i]; > @@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct > extent_buffer *eb, > struct page *page; > char *kaddr; > char __user *dst = (char __user *)dstv; >
Re: [PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it
On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote: > When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page > alignment and thus should use the PAGE_ALIGNED() macro instead of > open-coding it. > > Convert all open-coded occurrences of PAGE_ALIGNED(). > > Signed-off-by: Johannes Thumshirn Reviewed-by: Nikolay Borisov > --- > fs/btrfs/check-integrity.c | 8 > fs/btrfs/compression.c | 2 +- > fs/btrfs/inode.c | 2 +- > 3 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c > index d319c3020c09..84e9729badaa 100644 > --- a/fs/btrfs/check-integrity.c > +++ b/fs/btrfs/check-integrity.c > @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state > *state, > BUG_ON(block_ctx->datav); > BUG_ON(block_ctx->pagev); > BUG_ON(block_ctx->mem_to_free); > - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) { > + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) { > pr_info("btrfsic: read_block() with unaligned bytenr %llu\n", > block_ctx->dev_bytenr); > return -1; > @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct > btrfsic_dev_state *dev_state, > return; > } > is_metadata = 1; > - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1)); > + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE)); > processed_len = BTRFS_SUPER_INFO_SIZE; > if (state->print_mask & > BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) { > @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, > struct list_head *dev_head = _devices->devices; > struct btrfs_device *device; > > - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) { > + if (!PAGE_ALIGNED(fs_info->nodesize)) { > pr_info("btrfsic: cannot handle nodesize %d not being a > multiple of PAGE_SIZE %ld!\n", > fs_info->nodesize, PAGE_SIZE); > return -1; > } > - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) { > + if (!PAGE_ALIGNED(fs_info->sectorsize)) { > pr_info("btrfsic: cannot handle sectorsize %d not being a > multiple of PAGE_SIZE %ld!\n", > fs_info->sectorsize, PAGE_SIZE); > return -1; > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 2ab5591449f2..d5381f39a9e8 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode > *inode, u64 start, > blk_status_t ret; > int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; > > - WARN_ON(start & ((u64)PAGE_SIZE - 1)); > + WARN_ON(!PAGE_ALIGNED(start)); > cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS); > if (!cb) > return BLK_STS_RESOURCE; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index bc0564c384de..5c52e91b01e8 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 > start, u64 end, > unsigned int extra_bits, > struct extent_state **cached_state, int dedupe) > { > - WARN_ON((end & (PAGE_SIZE - 1)) == 0); > + WARN_ON(PAGE_ALIGNED(end)); > return set_extent_delalloc(_I(inode)->io_tree, start, end, > extra_bits, cached_state); > } >
[PATCH 1/3] btrfs: use offset_in_page instead of open-coding it
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Signed-off-by: Johannes Thumshirn --- fs/btrfs/check-integrity.c | 12 +-- fs/btrfs/compression.c | 2 +- fs/btrfs/extent_io.c | 53 +- fs/btrfs/file.c| 4 ++-- fs/btrfs/inode.c | 7 +++--- fs/btrfs/send.c| 2 +- fs/btrfs/volumes.c | 2 +- 7 files changed, 38 insertions(+), 44 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 781cae168d2a..d319c3020c09 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data( void *dstv, u32 offset, size_t len) { size_t cur; - size_t offset_in_page; + size_t pgoff; char *kaddr; char *dst = (char *)dstv; - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1); + size_t start_offset = offset_in_page(block_ctx->start); unsigned long i = (start_offset + offset) >> PAGE_SHIFT; WARN_ON(offset + len > block_ctx->len); - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1); + pgoff = offset_in_page(start_offset + offset); while (len > 0) { - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page)); + cur = min(len, ((size_t)PAGE_SIZE - pgoff)); BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE)); kaddr = block_ctx->datav[i]; - memcpy(dst, kaddr + offset_in_page, cur); + memcpy(dst, kaddr + pgoff, cur); dst += cur; len -= cur; - offset_in_page = 0; + pgoff = 0; i++; } } diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index dba59ae914b8..2ab5591449f2 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, if (page->index == end_index) { char *userpage; - size_t zero_offset = isize & (PAGE_SIZE - 1); + size_t zero_offset = offset_in_page(isize); if (zero_offset) { int zeros; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index b2769e92b556..e365c5272e6b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio) unsigned off; /* Zero out the end if this page straddles i_size */ - off = i_size & (PAGE_SIZE-1); + off = offset_in_page(i_size); if (page->index == end_index && off) zero_user_segment(page, off, PAGE_SIZE); SetPageUptodate(page); @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree, if (page->index == last_byte >> PAGE_SHIFT) { char *userpage; - size_t zero_offset = last_byte & (PAGE_SIZE - 1); + size_t zero_offset = offset_in_page(last_byte); if (zero_offset) { iosize = PAGE_SIZE - zero_offset; @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, ClearPageError(page); - pg_offset = i_size & (PAGE_SIZE - 1); + pg_offset = offset_in_page(i_size); if (page->index > end_index || (page->index == end_index && !pg_offset)) { page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE); @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv, struct page *page; char *kaddr; char *dst = (char *)dstv; - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1); + size_t start_offset = offset_in_page(eb->start); unsigned long i = (start_offset + start) >> PAGE_SHIFT; if (start + len > eb->len) { @@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv, return; } - offset = (start_offset + start) & (PAGE_SIZE - 1); + offset = offset_in_page(start_offset + start); while (len > 0) { page = eb->pages[i]; @@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct extent_buffer *eb, struct page *page; char *kaddr; char __user *dst = (char __user *)dstv; - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1); + size_t start_offset = offset_in_page(eb->start); unsigned long i = (start_offset + start) >> PAGE_SHIFT; int ret
[RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can be replaced by the offset_in_page() macro instead of open-coding it. Add a coccinelle semantic patch to ease detection and conversion of these. This unfortunately doesn't account for the case when we want PAGE_ALIGNED() instead of offset_in_page() yet. Cc: Julia Lawall Signed-off-by: Johannes Thumshirn --- scripts/coccinelle/api/offset_in_page.cocci | 81 + 1 file changed, 81 insertions(+) create mode 100644 scripts/coccinelle/api/offset_in_page.cocci diff --git a/scripts/coccinelle/api/offset_in_page.cocci b/scripts/coccinelle/api/offset_in_page.cocci new file mode 100644 index ..ea5b3a8e0390 --- /dev/null +++ b/scripts/coccinelle/api/offset_in_page.cocci @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 +/// +/// Use offset_in_page macro on address instead of explicit computation. +/// +// Confidence: High +// Keywords: offset_in_page +// Comment: Based on vma_pages.cocci + +virtual context +virtual patch +virtual org +virtual report + + +//-- +// For context mode +//-- + +@r_context depends on context && !patch && !org && !report@ +expression E; +@@ + +( +* E & ~PAGE_MASK +| +* E & (PAGE_SIZE - 1) +) + + +//-- +// For patch mode +//-- + +@r_patch depends on !context && patch && !org && !report@ +expression E; +type T; +@@ + +( +- E & ~PAGE_MASK ++ offset_in_page(E) +| +- E & (PAGE_SIZE - 1) ++ offset_in_page(E) +| +- E & ((T)PAGE_SIZE - 1) ++ offset_in_page(E) +) + +//-- +// For org mode +//-- + +@r_org depends on !context && !patch && (org || report)@ +expression E; +position p; +@@ + + ( + * E@p & ~PAGE_MASK + | + * E@p & (PAGE_SIZE - 1) + ) + +@script:python depends on report@ +p << r_org.p; +x << r_org.E; +@@ + +msg="WARNING: Consider using offset_in_page helper on %s" % (x) +coccilib.report.print_report(p[0], msg) + +@script:python depends on org@ +p << r_org.p; +x << r_org.E; +@@ + +msg="WARNING: Consider using offset_in_page helper on %s" % (x) +msg_safe=msg.replace("[","@(").replace("]",")") +coccilib.org.print_todo(p[0], msg_safe) + -- 2.16.4
[PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED
Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them throughout btrfs. This series also includes a patch for 'make coccicheck' which is marked as an RFC and I've CCed Julia in the hoping to get input from her. Johannes Thumshirn (3): btrfs: use offset_in_page instead of open-coding it btrfs: use PAGE_ALIGNED instead of open-coding it coccinelle: api: add offset_in_page.cocci fs/btrfs/check-integrity.c | 20 +++ fs/btrfs/compression.c | 4 +- fs/btrfs/extent_io.c| 53 +-- fs/btrfs/file.c | 4 +- fs/btrfs/inode.c| 9 ++-- fs/btrfs/send.c | 2 +- fs/btrfs/volumes.c | 2 +- scripts/coccinelle/api/offset_in_page.cocci | 81 + 8 files changed, 125 insertions(+), 50 deletions(-) create mode 100644 scripts/coccinelle/api/offset_in_page.cocci -- 2.16.4
[PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page alignment and thus should use the PAGE_ALIGNED() macro instead of open-coding it. Convert all open-coded occurrences of PAGE_ALIGNED(). Signed-off-by: Johannes Thumshirn --- fs/btrfs/check-integrity.c | 8 fs/btrfs/compression.c | 2 +- fs/btrfs/inode.c | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index d319c3020c09..84e9729badaa 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state *state, BUG_ON(block_ctx->datav); BUG_ON(block_ctx->pagev); BUG_ON(block_ctx->mem_to_free); - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) { + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) { pr_info("btrfsic: read_block() with unaligned bytenr %llu\n", block_ctx->dev_bytenr); return -1; @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct btrfsic_dev_state *dev_state, return; } is_metadata = 1; - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1)); + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE)); processed_len = BTRFS_SUPER_INFO_SIZE; if (state->print_mask & BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) { @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, struct list_head *dev_head = _devices->devices; struct btrfs_device *device; - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) { + if (!PAGE_ALIGNED(fs_info->nodesize)) { pr_info("btrfsic: cannot handle nodesize %d not being a multiple of PAGE_SIZE %ld!\n", fs_info->nodesize, PAGE_SIZE); return -1; } - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) { + if (!PAGE_ALIGNED(fs_info->sectorsize)) { pr_info("btrfsic: cannot handle sectorsize %d not being a multiple of PAGE_SIZE %ld!\n", fs_info->sectorsize, PAGE_SIZE); return -1; diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 2ab5591449f2..d5381f39a9e8 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, blk_status_t ret; int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; - WARN_ON(start & ((u64)PAGE_SIZE - 1)); + WARN_ON(!PAGE_ALIGNED(start)); cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS); if (!cb) return BLK_STS_RESOURCE; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bc0564c384de..5c52e91b01e8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, struct extent_state **cached_state, int dedupe) { - WARN_ON((end & (PAGE_SIZE - 1)) == 0); + WARN_ON(PAGE_ALIGNED(end)); return set_extent_delalloc(_I(inode)->io_tree, start, end, extra_bits, cached_state); } -- 2.16.4
Btrfs progs release 4.19.1
Hi, btrfs-progs version 4.19.1 have been released. There are build fixes, minor update to libbtrfsutil and documentation updates. Changes since 4.19.1-rc1: fix typos in CHANGES Changes: * build fixes * big-endian builds fail due to bswap helper clashes * 'swap' macro is too generic, renamed to prevent build failures * libbtrfs * minor version update to 1.1.0 * fix default search to top=0 as documented * rename 'async' to avoid future python binding problems * add support for unprivileged subvolume listing ioctls * added tests, API docs * other * lot of typos fixed * warning cleanups * doc formatting updates * CI tests against zstd 1.3.7 Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: David Sterba (7): btrfs-progs: kerncompat: rename swap to __swap btrfs-progs: README: add link to INSTALL btrfs-progs: docs: fix rendering of exponents in manual pages btrfs-progs: link to libbtrfsutil/README from the main README btrfs-progs: tests: pull zstd version 1.3.7 to the travis CI btrfs-progs: update CHANGES for v4.19.1 Btrfs progs v4.19.1 Josh Soref (11): btrfs-progs: docs: fix typos in Documentation btrfs-progs: docs: fix typos in READMEs, INSTALL and CHANGES btrfs-progs: fix typos in Makefile btrfs-progs: tests: fix typos in test comments btrfs-progs: tests: fsck/025, fix typo in helpre name btrfs-progs: fix typos in comments btrfs-progs: fix typos in user-visible strings btrfs-progs: check: fix typo in device_extent_record::chunk_objectid btrfs-progs: utils: fix typo in a variable btrfs-progs: mkfs: fix typo in "multipler" variables btrfs-progs: fix typo in btrfs-list function export Omar Sandoval (10): libbtrfsutil: use top=0 as default for SubvolumeIterator() libbtrfsutil: change async parameters to async_ in Python bindings libbtrfsutil: document qgroup_inherit parameter in Python bindings libbtrfsutil: use SubvolumeIterator as context manager in tests libbtrfsutil: add test helpers for dropping privileges libbtrfsutil: allow tests to create multiple Btrfs instances libbtrfsutil: relax the privileges of subvolume_info() libbtrfsutil: relax the privileges of subvolume iterator libbtrfsutil: bump version to 1.1.0 libbtrfsutil: document API in README Rosen Penev (3): btrfs-progs: kernel-lib: bitops: Fix big endian compilation btrfs-progs: task-utils: Fix comparison between pointer and integer btrfs-progs: treewide: Fix missing declarations
Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning
On 2018/12/5 下午9:40, David Sterba wrote: > On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote: >> GCC 8.2.1 will report the following warning with "make W=1": >> >> ctree.c: In function 'btrfs_next_sibling_tree_block': >> ctree.c:2990:21: warning: 'slot' may be used uninitialized in this >> function [-Wmaybe-uninitialized] >> path->slots[level] = slot; >> ~~~^~ >> >> The culprit is the following code: >> >> int slot; << Not initialized >> int level = path->lowest_level + 1; >> BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL); >> while(level < BTRFS_MAX_LEVEL) { >> slot = path->slots[level] + 1; >> ^^ but we initialize @slot here. >> ... >> } >> path->slots[level] = slot; >> >> It's possible that compiler doesn't get enough hint for BUG_ON() on >> lowest_level + 1 >= BTRFS_MAX_LEVEL case. >> >> Fix it by using a do {} while() loop other than while() {} loop, to >> ensure we will run the loop for at least once. > > I was hoping that we can actually add the hint to BUG_ON so the code > does not continue if the condition is true. > I checked that method, but I'm not that confident about things like: bugon_trace() { if (!val) return; __bugon_trace(); } __attribute__ ((noreturn)) static inline void __bugon_trace(); This is as simple as just one extra function call, but the original problem is just one more function call before hitting abort(). So I just give up screwing up things I'm not comfort enough to tweaking. The current do {} while() loop is the most direct solution, if gcc one day still gives such warning then I could say some harsh word then. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: [PATCH 07/10] dax: export functions for use with btrfs
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: [...] > -static void *grab_mapping_entry(struct xa_state *xas, > +void *grab_mapping_entry(struct xa_state *xas, > struct address_space *mapping, unsigned long size_flag) > { > unsigned long index = xas->xa_index; > @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas, > xas_unlock_irq(xas); > return xa_mk_internal(VM_FAULT_FALLBACK); > } > +EXPORT_SYMBOL(grab_mapping_entry); dax_grab_mapping_entry() please. -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 00/10] btrfs: Support for DAX devices
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. Yay! > I understand there have been previous attempts at it. However, I wanted > to make sure copy-on-write (COW) works on dax as well. btrfs' usual use of CoW and DAX are thoroughly in conflict. The very point of DAX is to have writes not go through the kernel, you mmap the file then do all writes right to the pmem, flushing when needed (without hitting the kernel) and having the processor+memory persist what you wrote. CoW via page faults are fine -- pmem is closer to memory than disk, and this means the kernel will ask the filesystem for an extent to place the new page in, copy the contents and let the process play with it. But real btrfs CoW would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ. Delaying CoW until the next commit doesn't help -- you'd need to store the dirty page in DRAM then write it, which goes against the whole concept of DAX. Only way I see would be to CoW once then pretend the page is nodatacow until the next commit, when we checksum it, add to the metadata trees, and mark for CoWing on the next write. Lots of complexity, and you still need to copy the whole thing every commit (so no gain). Ie, we're in nodatacow land. CoW for metadata is fine. > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: > > Questions: > 1. I have been unable to do checksumming for DAX devices. While > checksumming can be done for reads and writes, it is a problem when mmap > is involved because btrfs kernel module does not get back control after > an mmap() writes. Any ideas are appreciated, or we would have to set > nodatasum when dax is enabled. Per the above, it sounds like nodatacow (ie, "cow once") would be needed. > 2. Currently, a user can continue writing on "old" extents of an mmaped file > after a snapshot has been created. How can we enforce writes to be directed > to new extents after snapshots have been created? Do we keep a list of > all mmap()s, and re-mmap them after a snapshot? Same as for any other memory that's shared: when a new instance of sharing is added (a snapshot/reflink in our case), you deny writes, causing a page fault on the next attempt. "pmem" is named "ᴘersistent ᴍᴇᴍory" for a reason... > Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel > command line parameter. Might be more useful to use a bigger piece of the "disk" than 2G, it's not in the danger area though. Also note that it's utterly pointless to use any RAID modes; multi-dev single is fine, DUP counts as RAID here. * RAID0 is already done better in hardware (interleave) * RAID1 would require hardware support, replication isn't easy * RAID5/6 What would make sense, is disabling dax for any files that are not marked as nodatacow. This way, unrelated files can still use checksums or compression, while only files meant as a pmempool or otherwise by a pmem-aware program would have dax writes (you can still give read-only pages that CoW to DRAM). This way we can have write dax for only a subset of files, and full set of btrfs features for the rest. Write dax is dangerous for programs that have no specific support: the vast majority of database-like programs rely on page-level atomicity while pmem gives you cacheline/word atomicity only; torn writes mean data loss. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in ⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned ⠈⠳⣄ to the city of his birth to die.
Re: [PATCH 06/10] btrfs: dax write support
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: [...] > +static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos) > +{ > +struct dax_device *dax_dev; ^ space instead of tabs? > + void *saddr; > + sector_t start; > + size_t len; > + > + if (em->block_start == EXTENT_MAP_HOLE) { > + memset(daddr, 0, PAGE_SIZE); > + } else { > + dax_dev = fs_dax_get_by_bdev(em->bdev); > + start = (get_start_sect(em->bdev) << 9) + (em->block_start + > (pos - em->start)); > + len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , > NULL); > + memcpy(daddr, saddr, PAGE_SIZE); > + } > + free_extent_map(em); > + > + return 0; > +} > + > + copy_extent_page() always returns 0, why not make it void? Plus a nit: double newline. > +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) > +{ > + ssize_t ret, done = 0, count = iov_iter_count(from); > +struct inode *inode = file_inode(iocb->ki_filp); ^ again spaces vs tabs. > + u64 pos = iocb->ki_pos; > + u64 start = round_down(pos, PAGE_SIZE); > + u64 end = round_up(pos + count, PAGE_SIZE); > + struct extent_state *cached_state = NULL; > + struct extent_changeset *data_reserved = NULL; > + struct extent_map *first = NULL, *last = NULL; > + > + ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - > start); > + if (ret < 0) > + return ret; > + > + /* Grab a reference of the first extent to copy data */ > + if (start < pos) { > + first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - > start, 0); > + if (IS_ERR(first)) { > + ret = PTR_ERR(first); > + goto out2; > + } > + } You're using 'end - start' at least twice here, maybe you could move 'len' out of the loop and use it for btrfs_delalloc_reserve_space() and btrfs_get_extent() as well. > + > + /* Grab a reference of the last extent to copy data */ > + if (pos + count < end) { > + last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - > PAGE_SIZE, PAGE_SIZE, 0); > + if (IS_ERR(last)) { > + ret = PTR_ERR(last); > + goto out2; > + } > + } > + > + lock_extent_bits(_I(inode)->io_tree, start, end, _state); > + while (done < count) { > + struct extent_map *em; > + struct dax_device *dax_dev; > + int offset = pos & (PAGE_SIZE - 1); > + u64 estart = round_down(pos, PAGE_SIZE); > + u64 elen = end - estart; > + size_t len = count - done; > + sector_t dstart; > + void *daddr; > + ssize_t maplen; > + > + /* Read the current extent */ > +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, > 0); Space again. > + if (IS_ERR(em)) { > + ret = PTR_ERR(em); > + goto out; > + } > + > + /* Get a new extent */ > + ret = btrfs_get_extent_map_write(, NULL, inode, estart, > elen); > + if (ret < 0) > + goto out; > + > + dax_dev = fs_dax_get_by_bdev(em->bdev); > + /* Calculate start address start of destination extent */ > + dstart = (get_start_sect(em->bdev) << 9) + em->block_start; > + maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart), > + PHYS_PFN(em->len), , NULL); > + > + /* Copy front of extent page */ > + if (offset) > + ret = copy_extent_page(first, daddr, estart); > + > + /* Copy end of extent page */ > + if ((pos + len > estart + PAGE_SIZE) && (pos + len < em->start > + em->len)) > + ret = copy_extent_page(last, daddr + em->len - > PAGE_SIZE, em->start + em->len - PAGE_SIZE); > + > + /* Copy the data from the iter */ > + maplen = PFN_PHYS(maplen); > + maplen -= offset; > + ret = dax_copy_from_iter(dax_dev, dstart, daddr + offset, > maplen, from); > + if (ret < 0) > + goto out; > + pos += ret; > + done += ret; > + } > +out: out_unlock? > + unlock_extent_cached(_I(inode)->io_tree, start, end, > _state); > + if (done) { > + btrfs_update_ordered_extent(inode, start, > + end - start, true); > + iocb->ki_pos += done; > + if (iocb->ki_pos > i_size_read(inode)) > + i_size_write(inode, iocb->ki_pos); > + } > + > + btrfs_delalloc_release_extents(BTRFS_I(inode), count, false); > +out2: out? > + if (count - done > 0) > + btrfs_delalloc_release_space(inode, data_reserved, pos, > + count
Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning
On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote: > GCC 8.2.1 will report the following warning with "make W=1": > > ctree.c: In function 'btrfs_next_sibling_tree_block': > ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function > [-Wmaybe-uninitialized] > path->slots[level] = slot; > ~~~^~ > > The culprit is the following code: > > int slot; << Not initialized > int level = path->lowest_level + 1; > BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL); > while(level < BTRFS_MAX_LEVEL) { > slot = path->slots[level] + 1; > ^^ but we initialize @slot here. > ... > } > path->slots[level] = slot; > > It's possible that compiler doesn't get enough hint for BUG_ON() on > lowest_level + 1 >= BTRFS_MAX_LEVEL case. > > Fix it by using a do {} while() loop other than while() {} loop, to > ensure we will run the loop for at least once. I was hoping that we can actually add the hint to BUG_ON so the code does not continue if the condition is true.
Re: [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Since we will be using it in another part of the code, use a > better name to declare it non-static > > Signed-off-by: Goldwyn Rodrigues > --- > fs/btrfs/ctree.h | 7 +-- > fs/btrfs/inode.c | 14 +- > 2 files changed, 10 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 038d64ecebe5..5144d28216b0 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, > struct btrfs_key *location, > struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, >struct btrfs_root *root, int *was_new); > struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, > - struct page *page, size_t pg_offset, > - u64 start, u64 end, int create); > + struct page *page, size_t pg_offset, > + u64 start, u64 end, int create); > +void btrfs_update_ordered_extent(struct inode *inode, > + const u64 offset, const u64 bytes, > + const bool uptodate); > int btrfs_update_inode(struct btrfs_trans_handle *trans, > struct btrfs_root *root, > struct inode *inode); > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 9ea4c6f0352f..96e9fe9e4150 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode > *inode, u64 start, u64 len, > u64 ram_bytes, int compress_type, > int type); > > -static void __endio_write_update_ordered(struct inode *inode, > - const u64 offset, const u64 bytes, > - const bool uptodate); > - > /* > * Cleanup all submitted ordered extents in specified range to handle errors > * from the fill_dellaloc() callback. > @@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct > inode *inode, > ClearPagePrivate2(page); > put_page(page); > } > - return __endio_write_update_ordered(inode, offset + PAGE_SIZE, > + return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE, > bytes - PAGE_SIZE, false); > } > > @@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio) > bio_put(bio); > } > > -static void __endio_write_update_ordered(struct inode *inode, > +void btrfs_update_ordered_extent(struct inode *inode, >const u64 offset, const u64 bytes, >const bool uptodate) Since you are exporting the function I'd suggest to use the occasion to introduce proper kernel-doc. THe primary help will be if you document the context under which the function is called - when writes are finished and it's used to update respective portion of the ordered extent. > { > @@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio) > struct btrfs_dio_private *dip = bio->bi_private; > struct bio *dio_bio = dip->dio_bio; > > - __endio_write_update_ordered(dip->inode, dip->logical_offset, > + btrfs_update_ordered_extent(dip->inode, dip->logical_offset, >dip->bytes, !bio->bi_status); > > kfree(dip); > @@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, > struct inode *inode, > bio = NULL; > } else { > if (write) > - __endio_write_update_ordered(inode, > + btrfs_update_ordered_extent(inode, > file_offset, > dio_bio->bi_iter.bi_size, > false); > @@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, > struct iov_iter *iter) >*/ > if (dio_data.unsubmitted_oe_range_start < > dio_data.unsubmitted_oe_range_end) > - __endio_write_update_ordered(inode, > + btrfs_update_ordered_extent(inode, > dio_data.unsubmitted_oe_range_start, > dio_data.unsubmitted_oe_range_end - > dio_data.unsubmitted_oe_range_start, >
Re: [PATCH 03/10] btrfs: dax: read zeros from holes
On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Signed-off-by: Goldwyn Rodrigues > --- > fs/btrfs/dax.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c > index d614bf73bf8e..5a297674adec 100644 > --- a/fs/btrfs/dax.c > +++ b/fs/btrfs/dax.c > @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct > iov_iter *to) nit: I think it's better if you rename the iterator variable to "iter". > > BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING); > > -ret = em_dax_rw(inode, em, pos, len, to); > + if (em->block_start == EXTENT_MAP_HOLE) { > + u64 zero_len = min(em->len - (em->start - pos), len); Shouldn't this be em->len - (pos - em->start) since this gives the remaining length of the extent? Isn't pos guaranteed to be >= em->start ? > + ret = iov_iter_zero(zero_len, to); > + } else { > + ret = em_dax_rw(inode, em, pos, len, to); > + } > if (ret < 0) > goto out; > pos += ret; >
Re: [PATCH 02/10] btrfs: basic dax read
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Signed-off-by: Goldwyn Rodrigues Can you explain why we can't use th dax_iomap_rw() interface like XFS or EXT4? [...] > +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos, > + u64 len, struct iov_iter *iter) > +{ > +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev); > +ssize_t map_len; > +pgoff_t blk_pg; > +void *kaddr; > +sector_t blk_start; > +unsigned offset = pos & (PAGE_SIZE - 1); Nit: unsigned offset = offset_in_page(pos); -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 02/10] btrfs: basic dax read
On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Signed-off-by: Goldwyn Rodrigues > --- > fs/btrfs/Makefile | 1 + > fs/btrfs/ctree.h | 5 > fs/btrfs/dax.c| 68 > +++ > fs/btrfs/file.c | 13 ++- > 4 files changed, 86 insertions(+), 1 deletion(-) > create mode 100644 fs/btrfs/dax.c > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index ca693dd554e9..1fa77b875ae9 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o > root-tree.o dir-item.o \ > reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > uuid-tree.o props.o free-space-tree.o tree-checker.o > > +btrfs-$(CONFIG_FS_DAX) += dax.o > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 5cc470fa6a40..038d64ecebe5 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle); > void btrfs_reada_detach(void *handle); > int btree_readahead_hook(struct extent_buffer *eb, int err); > > +#ifdef CONFIG_FS_DAX > +/* dax.c */ > +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); > +#endif /* CONFIG_FS_DAX */ > + > static inline int is_fstree(u64 rootid) > { > if (rootid == BTRFS_FS_TREE_OBJECTID || > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c > new file mode 100644 > index ..d614bf73bf8e > --- /dev/null > +++ b/fs/btrfs/dax.c > @@ -0,0 +1,68 @@ > +#include > +#include > +#include "ctree.h" > +#include "btrfs_inode.h" > + > +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos, > + u64 len, struct iov_iter *iter) > +{ > +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev); > +ssize_t map_len; > +pgoff_t blk_pg; > +void *kaddr; > +sector_t blk_start; > +unsigned offset = pos & (PAGE_SIZE - 1); offset = offset_in_page(pos) > + > +len = min(len + offset, em->len - (pos - em->start)); > +len = ALIGN(len, PAGE_SIZE); len = PAGE_ALIGN(len); > +blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + > (pos - em->start)); > +blk_pg = blk_start - offset; > +map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), > PHYS_PFN(len), , NULL); > +map_len = PFN_PHYS(map_len)> +kaddr += offset; > +map_len -= offset; > +if (map_len > len) > +map_len = len; map_len = min(map_len, len); > +if (iov_iter_rw(iter) == WRITE) > +return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, > iter); > +else > +return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, > iter); Have you looked at the implementation of dax_iomap_actor where they have pretty similar code. In case of either of those returning 0 they set ret to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will be good of you can follow dax_iomap_actor's logic as much as possible since this code has been used for quite some time and is deemed robust. > +} > + > +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) > +{ > +size_t ret = 0, done = 0, count = iov_iter_count(to); > +struct extent_map *em; > +u64 pos = iocb->ki_pos; > +u64 end = pos + count; > +struct inode *inode = file_inode(iocb->ki_filp); > + > +if (!count) > +return 0; > + > +end = i_size_read(inode) < end ? i_size_read(inode) : end; end = min(i_size_read(inode), end) > + > +while (pos < end) { > +u64 len = end - pos; > + > +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0); > +if (IS_ERR(em)) { > +if (!ret) > +ret = PTR_ERR(em); > +goto out; > +} > + > +BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING); I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set for extents that map chunk and those are housed in the chunk tree at fs_info->mapping_tree. Since the write call back is only ever called for file inodes I'd say this BUG_ON can be eliminated. Did you manage to trigger it during development? > + > +ret = em_dax_rw(inode, em, pos, len, to); > +if (ret < 0) > +goto out; > +pos += ret; > +done += ret; > +} > + > +out: > +iocb->ki_pos += done; > +return done ? done : ret; > +} > + > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 58e93bce3036..ef6ed93f44d1 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@
Re: [PATCH 00/10] btrfs: Support for DAX devices
On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. I understand there have been > previous attempts at it. However, I wanted to make sure copy-on-write > (COW) works on dax as well. > > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: > > Questions: > 1. I have been unable to do checksumming for DAX devices. While > checksumming can be done for reads and writes, it is a problem when mmap > is involved because btrfs kernel module does not get back control after > an mmap() writes. Any ideas are appreciated, or we would have to set > nodatasum when dax is enabled. I'm not familar with DAX, so it's completely possible I'm talking like an idiot. If btrfs_page_mkwrite() can't provide enough control, then I have a crazy idea. Forcing page fault for every mmap() read/write (completely disable page cache like DIO). So that we could get some control since we're informed to read the page and do some hacks there. Thanks, Qu > > 2. Currently, a user can continue writing on "old" extents of an mmaped file > after a snapshot has been created. How can we enforce writes to be directed > to new extents after snapshots have been created? Do we keep a list of > all mmap()s, and re-mmap them after a snapshot? > > Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel > command line parameter. > > > [PATCH 01/10] btrfs: create a mount option for dax > [PATCH 02/10] btrfs: basic dax read > [PATCH 03/10] btrfs: dax: read zeros from holes > [PATCH 04/10] Rename __endio_write_update_ordered() to > [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of > [PATCH 06/10] btrfs: dax write support > [PATCH 07/10] dax: export functions for use with btrfs > [PATCH 08/10] btrfs: dax add read mmap path > [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared > [PATCH 10/10] btrfs: dax mmap write > > fs/btrfs/Makefile |1 > fs/btrfs/ctree.h| 17 ++ > fs/btrfs/dax.c | 303 > ++-- > fs/btrfs/file.c | 29 > fs/btrfs/inode.c| 54 + > fs/btrfs/ioctl.c|5 > fs/btrfs/super.c| 15 ++ > fs/dax.c| 35 -- > include/linux/dax.h | 16 ++ > 9 files changed, 430 insertions(+), 45 deletions(-) > > signature.asc Description: OpenPGP digital signature
Re: [PATCH 01/10] btrfs: create a mount option for dax
On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Also, set the inode->i_flags to S_DAX > > Signed-off-by: Goldwyn Rodrigues Reviewed-by: Nikolay Borisov One question below though . > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/ioctl.c | 5 - > fs/btrfs/super.c | 15 +++ > 3 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 68f322f600a0..5cc470fa6a40 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct > btrfs_fs_info *info) > #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) > #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) > #define BTRFS_MOUNT_REF_VERIFY (1 << 28) > +#define BTRFS_MOUNT_DAX (1 << 29) > > #define BTRFS_DEFAULT_COMMIT_INTERVAL(30) > #define BTRFS_DEFAULT_MAX_INLINE (2048) > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 802a628e9f7d..e9146c157816 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode > *inode) > if (binode->flags & BTRFS_INODE_DIRSYNC) > new_fl |= S_DIRSYNC; > > + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && > S_ISREG(inode->i_mode)) > + new_fl |= S_DAX; > + > set_mask_bits(>i_flags, > - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, > + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | > S_DAX, > new_fl); > } > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 645fc81e2a94..035263b61cf5 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -326,6 +326,7 @@ enum { > Opt_treelog, Opt_notreelog, > Opt_usebackuproot, > Opt_user_subvol_rm_allowed, > + Opt_dax, > > /* Deprecated options */ > Opt_alloc_start, > @@ -393,6 +394,7 @@ static const match_table_t tokens = { > {Opt_notreelog, "notreelog"}, > {Opt_usebackuproot, "usebackuproot"}, > {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, > + {Opt_dax, "dax"}, > > /* Deprecated options */ > {Opt_alloc_start, "alloc_start=%s"}, > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char > *options, > case Opt_user_subvol_rm_allowed: > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); > break; > +#ifdef CONFIG_FS_DAX > + case Opt_dax: > + if (btrfs_super_num_devices(info->super_copy) > 1) { > + btrfs_info(info, > +"dax not supported for multi-device > btrfs partition\n"); What prevents supporting dax for multiple devices so long as all devices are dax? >
Re: [PATCH 01/10] btrfs: create a mount option for dax
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Also, set the inode->i_flags to S_DAX > > Signed-off-by: Goldwyn Rodrigues > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/ioctl.c | 5 - > fs/btrfs/super.c | 15 +++ > 3 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 68f322f600a0..5cc470fa6a40 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct > btrfs_fs_info *info) > #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) > #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) > #define BTRFS_MOUNT_REF_VERIFY (1 << 28) > +#define BTRFS_MOUNT_DAX (1 << 29) Just as a heads up, this will collide with the patch called '[RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE' from Su Yue. [...] > +#ifdef CONFIG_FS_DAX > + case Opt_dax: > + if (btrfs_super_num_devices(info->super_copy) > 1) { > + btrfs_info(info, > +"dax not supported for multi-device > btrfs partition\n"); > + ret = -EOPNOTSUPP; > + goto out; > + } > + btrfs_set_opt(info->mount_opt, DAX); > + break; > +#endif Can you please explain why we can't enable DAX on a multi device FS in the changelog? It's (for me at least) not obvious. Thanks, Johannes -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH 10/10] btrfs: dax mmap write
From: Goldwyn Rodrigues Create a page size extent and copy the contents of the original extent into the new one, and present to user space as the page to write. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/dax.c | 39 +++ 1 file changed, 39 insertions(+) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 6d68d39cc5da..4634917877f3 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -231,6 +231,45 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) sector >>= 9; ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, vmf->cow_page, vaddr); goto out; + } else if (vmf->flags & FAULT_FLAG_WRITE) { + pfn_t pfn; + struct extent_map *orig = em; + void *daddr; + sector_t dstart; + size_t maplen; + struct extent_changeset *data_reserved = NULL; + struct extent_state *cached_state = NULL; + + ret = btrfs_delalloc_reserve_space(inode, _reserved, pos, PAGE_SIZE); + if (ret < 0) + return ret; + refcount_inc(>refs); + lock_extent_bits(_I(inode)->io_tree, pos, pos + PAGE_SIZE, _state); + /* Create an extent of page size */ + ret = btrfs_get_extent_map_write(, NULL, inode, pos, + PAGE_SIZE); + if (ret < 0) { + free_extent_map(orig); + btrfs_delalloc_release_space(inode, data_reserved, pos, + PAGE_SIZE, true); + goto out; + } + + dax_dev = fs_dax_get_by_bdev(em->bdev); + /* Calculate start address of destination extent */ + dstart = (get_start_sect(em->bdev) << 9) + em->block_start; + maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart), + 1, , ); + + /* Copy the original contents into new destination */ + copy_extent_page(orig, daddr, pos); + btrfs_update_ordered_extent(inode, pos, PAGE_SIZE, true); + dax_insert_entry(, mapping, vmf, entry, pfn, 0, false); + ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); + free_extent_map(orig); + unlock_extent_cached(_I(inode)->io_tree, pos, pos + PAGE_SIZE, _state); + extent_changeset_free(data_reserved); + btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false); } else { sector_t sector; if (em->block_start == EXTENT_MAP_HOLE) { -- 2.16.4
[PATCH 06/10] btrfs: dax write support
From: Goldwyn Rodrigues This is a combination of direct and buffered I/O. Similarties with direct I/O is that it needs to allocate space before writing. Similarities with buffered is when the data is not page-aligned, it needs to copy parts of the previous extents. In order to accomplish that, keep a references of the first and last extent (if required) and then perform allocations. If the "pos" or "end" is not aligned, copy the data from first and last extent respectively. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 1 + fs/btrfs/dax.c | 121 +++ fs/btrfs/file.c | 4 +- 3 files changed, 125 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a0d296b0d826..d91ff283a966 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3693,6 +3693,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); #ifdef CONFIG_FS_DAX /* dax.c */ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); #endif /* CONFIG_FS_DAX */ static inline int is_fstree(u64 rootid) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 5a297674adec..4000259a426c 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -2,6 +2,7 @@ #include #include "ctree.h" #include "btrfs_inode.h" +#include "extent_io.h" static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos, u64 len, struct iov_iter *iter) @@ -71,3 +72,123 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) return done ? done : ret; } +static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos) +{ +struct dax_device *dax_dev; + void *saddr; + sector_t start; + size_t len; + + if (em->block_start == EXTENT_MAP_HOLE) { + memset(daddr, 0, PAGE_SIZE); + } else { + dax_dev = fs_dax_get_by_bdev(em->bdev); + start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start)); + len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , NULL); + memcpy(daddr, saddr, PAGE_SIZE); + } + free_extent_map(em); + + return 0; +} + + +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) +{ + ssize_t ret, done = 0, count = iov_iter_count(from); +struct inode *inode = file_inode(iocb->ki_filp); + u64 pos = iocb->ki_pos; + u64 start = round_down(pos, PAGE_SIZE); + u64 end = round_up(pos + count, PAGE_SIZE); + struct extent_state *cached_state = NULL; + struct extent_changeset *data_reserved = NULL; + struct extent_map *first = NULL, *last = NULL; + + ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - start); + if (ret < 0) + return ret; + + /* Grab a reference of the first extent to copy data */ + if (start < pos) { + first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start, 0); + if (IS_ERR(first)) { + ret = PTR_ERR(first); + goto out2; + } + } + + /* Grab a reference of the last extent to copy data */ + if (pos + count < end) { + last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - PAGE_SIZE, PAGE_SIZE, 0); + if (IS_ERR(last)) { + ret = PTR_ERR(last); + goto out2; + } + } + + lock_extent_bits(_I(inode)->io_tree, start, end, _state); + while (done < count) { + struct extent_map *em; + struct dax_device *dax_dev; + int offset = pos & (PAGE_SIZE - 1); + u64 estart = round_down(pos, PAGE_SIZE); + u64 elen = end - estart; + size_t len = count - done; + sector_t dstart; + void *daddr; + ssize_t maplen; + + /* Read the current extent */ +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out; + } + + /* Get a new extent */ + ret = btrfs_get_extent_map_write(, NULL, inode, estart, elen); + if (ret < 0) + goto out; + + dax_dev = fs_dax_get_by_bdev(em->bdev); + /* Calculate start address start of destination extent */ + dstart = (get_start_sect(em->bdev) << 9) + em->block_start; + maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart), + PHYS_PFN(em->len), , NULL); + + /* Copy front of extent page */ + if (offset) + ret = copy_extent_page(first, daddr, estart); + +
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
From: Goldwyn Rodrigues Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/dax.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 88017f8799d1..6d68d39cc5da 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -198,10 +198,13 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) pfn_t pfn; struct address_space *mapping = vmf->vma->vm_file->f_mapping; XA_STATE(xas, >i_pages, vmf->pgoff); + unsigned long vaddr = vmf->address; struct inode *inode = mapping->host; loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT; void *entry = NULL; vm_fault_t ret = 0; + struct extent_map *em; + struct dax_device *dax_dev; if (pos > i_size_read(inode)) { ret = VM_FAULT_SIGBUS; @@ -214,21 +217,33 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) goto out; } - if (!vmf->cow_page) { +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0); + if (em->block_start != EXTENT_MAP_HOLE) + dax_dev = fs_dax_get_by_bdev(em->bdev); + + if (vmf->cow_page) { + sector_t sector; + if (em->block_start == EXTENT_MAP_HOLE) { + clear_user_highpage(vmf->cow_page, vaddr); + goto out; + } + sector = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start)); + sector >>= 9; + ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, vmf->cow_page, vaddr); + goto out; + } else { sector_t sector; - struct extent_map *em; -em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0); if (em->block_start == EXTENT_MAP_HOLE) { ret = dax_load_hole(, mapping, entry, vmf); goto out; } sector = ((get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start))) >> 9; - ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, PAGE_SIZE, ); + ret = dax_pfn(dax_dev, em->bdev, sector, PAGE_SIZE, ); if (ret) goto out; dax_insert_entry(, mapping, vmf, entry, pfn, 0, false); - ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn); + ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); } out: if (entry) -- 2.16.4
[PATCH 07/10] dax: export functions for use with btrfs
From: Goldwyn Rodrigues These functions are required for btrfs dax support. Signed-off-by: Goldwyn Rodrigues --- fs/dax.c| 35 --- include/linux/dax.h | 16 2 files changed, 40 insertions(+), 11 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 9bcce89ea18e..4578640af631 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -244,7 +244,7 @@ static void put_unlocked_entry(struct xa_state *xas, void *entry) * dropped the xa_lock, so we know the xa_state is stale and must be reset * before use. */ -static void dax_unlock_entry(struct xa_state *xas, void *entry) +void dax_unlock_entry(struct xa_state *xas, void *entry) { void *old; @@ -256,6 +256,7 @@ static void dax_unlock_entry(struct xa_state *xas, void *entry) BUG_ON(!dax_is_locked(old)); dax_wake_entry(xas, entry, false); } +EXPORT_SYMBOL(dax_unlock_entry); /* * Return: The entry stored at this location before it was locked. @@ -448,7 +449,7 @@ void dax_unlock_mapping_entry(struct page *page) * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values * overlap with xarray value entries. */ -static void *grab_mapping_entry(struct xa_state *xas, +void *grab_mapping_entry(struct xa_state *xas, struct address_space *mapping, unsigned long size_flag) { unsigned long index = xas->xa_index; @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas, xas_unlock_irq(xas); return xa_mk_internal(VM_FAULT_FALLBACK); } +EXPORT_SYMBOL(grab_mapping_entry); /** * dax_layout_busy_page - find first pinned page in @mapping @@ -654,7 +656,7 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, return __dax_invalidate_entry(mapping, index, false); } -static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, +int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, size_t size, struct page *to, unsigned long vaddr) { @@ -679,6 +681,7 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, dax_read_unlock(id); return 0; } +EXPORT_SYMBOL(copy_user_dax); /* * By this point grab_mapping_entry() has ensured that we have a locked entry @@ -687,7 +690,7 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, * already in the tree, we will skip the insertion and just dirty the PMD as * appropriate. */ -static void *dax_insert_entry(struct xa_state *xas, +void *dax_insert_entry(struct xa_state *xas, struct address_space *mapping, struct vm_fault *vmf, void *entry, pfn_t pfn, unsigned long flags, bool dirty) { @@ -736,6 +739,7 @@ static void *dax_insert_entry(struct xa_state *xas, xas_unlock_irq(xas); return entry; } +EXPORT_SYMBOL(dax_insert_entry); static inline unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) @@ -962,19 +966,18 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos) return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9; } -static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, -pfn_t *pfnp) +int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev, + const sector_t sector, size_t size, pfn_t *pfnp) { - const sector_t sector = dax_iomap_sector(iomap, pos); pgoff_t pgoff; int id, rc; long length; - rc = bdev_dax_pgoff(iomap->bdev, sector, size, ); + rc = bdev_dax_pgoff(bdev, sector, size, ); if (rc) return rc; id = dax_read_lock(); - length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), + length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), NULL, pfnp); if (length < 0) { rc = length; @@ -993,6 +996,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, dax_read_unlock(id); return rc; } +EXPORT_SYMBOL(dax_pfn); + +static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, +pfn_t *pfnp) +{ + const sector_t sector = dax_iomap_sector(iomap, pos); + return dax_pfn(iomap->dax_dev, iomap->bdev, sector, size, pfnp); +} /* * The user has performed a load from a hole in the file. Allocating a new @@ -1001,7 +1012,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, * If this page is ever written to we will re-fault and change the mapping to * point to real DAX storage instead. */ -static vm_fault_t dax_load_hole(struct xa_state *xas, +vm_fault_t dax_load_hole(struct xa_state *xas, struct address_space *mapping, void **entry, struct vm_fault *vmf) { @@ -1017,6 +1028,7 @@ static vm_fault_t dax_load_hole(struct xa_state
[PATCH 03/10] btrfs: dax: read zeros from holes
From: Goldwyn Rodrigues Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/dax.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index d614bf73bf8e..5a297674adec 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING); -ret = em_dax_rw(inode, em, pos, len, to); + if (em->block_start == EXTENT_MAP_HOLE) { + u64 zero_len = min(em->len - (em->start - pos), len); + ret = iov_iter_zero(zero_len, to); + } else { + ret = em_dax_rw(inode, em, pos, len, to); + } if (ret < 0) goto out; pos += ret; -- 2.16.4
[PATCH 00/10] btrfs: Support for DAX devices
This is a support for DAX in btrfs. I understand there have been previous attempts at it. However, I wanted to make sure copy-on-write (COW) works on dax as well. Before I present this to the FS folks I wanted to run this through the btrfs. Even though I wish, I cannot get it correct the first time around :/.. Here are some questions for which I need suggestions: Questions: 1. I have been unable to do checksumming for DAX devices. While checksumming can be done for reads and writes, it is a problem when mmap is involved because btrfs kernel module does not get back control after an mmap() writes. Any ideas are appreciated, or we would have to set nodatasum when dax is enabled. 2. Currently, a user can continue writing on "old" extents of an mmaped file after a snapshot has been created. How can we enforce writes to be directed to new extents after snapshots have been created? Do we keep a list of all mmap()s, and re-mmap them after a snapshot? Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel command line parameter. [PATCH 01/10] btrfs: create a mount option for dax [PATCH 02/10] btrfs: basic dax read [PATCH 03/10] btrfs: dax: read zeros from holes [PATCH 04/10] Rename __endio_write_update_ordered() to [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of [PATCH 06/10] btrfs: dax write support [PATCH 07/10] dax: export functions for use with btrfs [PATCH 08/10] btrfs: dax add read mmap path [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared [PATCH 10/10] btrfs: dax mmap write fs/btrfs/Makefile |1 fs/btrfs/ctree.h| 17 ++ fs/btrfs/dax.c | 303 ++-- fs/btrfs/file.c | 29 fs/btrfs/inode.c| 54 + fs/btrfs/ioctl.c|5 fs/btrfs/super.c| 15 ++ fs/dax.c| 35 -- include/linux/dax.h | 16 ++ 9 files changed, 430 insertions(+), 45 deletions(-) -- Goldwyn
[PATCH 01/10] btrfs: create a mount option for dax
From: Goldwyn Rodrigues Also, set the inode->i_flags to S_DAX Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 1 + fs/btrfs/ioctl.c | 5 - fs/btrfs/super.c | 15 +++ 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 68f322f600a0..5cc470fa6a40 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26) #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27) #define BTRFS_MOUNT_REF_VERIFY (1 << 28) +#define BTRFS_MOUNT_DAX(1 << 29) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 802a628e9f7d..e9146c157816 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode) if (binode->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode)) + new_fl |= S_DAX; + set_mask_bits(>i_flags, - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX, new_fl); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 645fc81e2a94..035263b61cf5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -326,6 +326,7 @@ enum { Opt_treelog, Opt_notreelog, Opt_usebackuproot, Opt_user_subvol_rm_allowed, + Opt_dax, /* Deprecated options */ Opt_alloc_start, @@ -393,6 +394,7 @@ static const match_table_t tokens = { {Opt_notreelog, "notreelog"}, {Opt_usebackuproot, "usebackuproot"}, {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, + {Opt_dax, "dax"}, /* Deprecated options */ {Opt_alloc_start, "alloc_start=%s"}, @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_user_subvol_rm_allowed: btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); break; +#ifdef CONFIG_FS_DAX + case Opt_dax: + if (btrfs_super_num_devices(info->super_copy) > 1) { + btrfs_info(info, + "dax not supported for multi-device btrfs partition\n"); + ret = -EOPNOTSUPP; + goto out; + } + btrfs_set_opt(info->mount_opt, DAX); + break; +#endif case Opt_enospc_debug: btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG); break; @@ -1329,6 +1342,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",clear_cache"); if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED)) seq_puts(seq, ",user_subvol_rm_allowed"); + if (btrfs_test_opt(info, DAX)) + seq_puts(seq, ",dax"); if (btrfs_test_opt(info, ENOSPC_DEBUG)) seq_puts(seq, ",enospc_debug"); if (btrfs_test_opt(info, AUTO_DEFRAG)) -- 2.16.4