Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression
"Huang, Ying" writes: > Hi, Chris, > > Chris Mason writes: > >> On 19 Jun 2018, at 23:51, Huang, Ying wrote: > "Huang, Ying" writes: > >> Hi, Josef, >> >> Do you have time to take a look at the regression? >> >> kernel test robot writes: >> >>> Greeting, >>> >>> FYI, we noticed a -12.3% regression of blogbench.write_score and >>> a +9.6% improvement >>> of blogbench.read_score due to commit: >>> >>> >>> commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use >>> sc->priority for slab shrink targets") >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git >>> master >>> >>> in testcase: blogbench >>> on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @ >>> 2.10GHz with 8G memory >>> with following parameters: >>> >>> disk: 1SSD >>> fs: btrfs >>> cpufreq_governor: performance >>> >>> test-description: Blogbench is a portable filesystem benchmark >>> that tries to reproduce the load of a real-world busy file >>> server. >>> test-url: >> >> I'm surprised, this patch is a big win in production here at FB. I'll >> have to reproduce these results to better understand what is going on. >> My first guess is that since we have fewer inodes in slab, we're >> reading more inodes from disk in order to do the writes. >> >> But that should also make our read scores lower. > > Any update on this? Ping. Best Regards, Huang, Ying -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix data lose with snapshot when nospace
Filipe Manana 於 2018-08-02 01:55 寫到: On Wed, Aug 1, 2018 at 1:54 PM, Filipe Manana wrote: On Wed, Aug 1, 2018 at 11:20 AM, robbieko wrote: Filipe Manana 於 2018-07-31 19:33 寫到: On Tue, Jul 31, 2018 at 11:17 AM, robbieko wrote: Filipe Manana 於 2018-07-30 20:34 寫到: On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana wrote: On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana wrote: On Mon, Jul 30, 2018 at 11:21 AM, robbieko wrote: From: Robbie Ko Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") modified the nocow writeback mechanism, if you create a snapshot, it will always switch to cow writeback. This will cause data loss when there is no space, because when the space is full, the write will not reserve any space, only check if it can be nocow write. This is a bit vague. You need to mention where space reservation does not happen (at the time of the write syscall) and why, and that the snapshot happens before flushing IO (running dealloc). Then when running dealloc we fallback to COW and fail. You also need to tell that although the write syscall did not return an error, the writeback will fail but a subsequent fsync on the file will return an error (ENOSPC) because the writeback set the error on the inode's mapping, so it's not completely a silent data loss, as for buffered writes there's no guarantee that if write syscall returns 0 the data will be persisted successfully (that can only be guaranteed if a subsequent fsync call returns 0). So fix this by first flush the nocow data, and then switch to the cow write. I'm also not seeing how what you've done is better then we have now using the root->will_be_snapshotted atomic, which is essentially used the same way as the new atomic you are adding, and forces the writeback code no nocow writes as well. So what you have done can be made much more simple by flushing delalloc before incrementing root->will_be_snapshotted instead of after incrementing it: https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX There is no way to solve this problem in this modification. It minimizes it. It only gives better guarantees that nocow buffered writes that happened before calling the snapshot ioctl will not fall back to cow, not for the ones that happen while the call to the ioctl is happening. When writing and create snapshot at the same time, the write will not reserve space, and will not return to ENOSPC, because will_be_snapshotted is still 0. So when writeback flush data, there will still be problems with ENOSPC. Which is precisely what I proposed does without adding a new atomic and more changes. It flushes delalloc before incrementing root->will_be_snapshotted, so that previous buffered nocow writes will not fallback to cow mode (and require data space allocation). It only leaves a very tiny and very unlikely to hit (but not impossible) time window where nocow writes will fallback to cow mode - after calling start_delalloc_inodes() and before incrementing root->will_be_snapshotted a new buffered write can comes in and gets immediately flushed because someone called fsync() on the file or the VM decided to trigger writeback (due to memory pressure or some other reason). It is very easy to reproduce. Not a tiny time. Because the time of start_delalloc_inodes will be very long. When the dirty page is reduced, the new write can continue to be written, and there is no reserve space. I see now, thanks. And these dirty pages are not necessarily flushed by start_delalloc_inodes because start_delalloc_inodes is checked from the front to the back, so when the new write is written to the previous position, it will not flush again. So when start_delalloc_inodes is executed, will_be_snapshotted is added, and all remaining dirty pages will be forced to turn to cow, all data is loss. e.g. 16G RAM, dirty ratio 20%, 16*0.2 = 3.2 GB data loss. Writing the inode that has been flushed will have the same problem. The inode that has been flushed will not be flushed for the second time. So you have better suggestions ? Not efficient ones (introducing more locks). This one if fine, but please write a changelog that mentions all these details from your replies. The original one does not explain in detail the problem nor the solution. Some comments before incrementing both atomics and flushing delalloc at ioctl.c would also be good to have. And a test case for fstests too. So the fstests test case I converted my testing script into a proper patch for fstests: https://friendpaste.com/1oKSUHoZjp9OZtxDcK1Mey This test case is very good. Once you get a better title for the patch, I'll update the test's patch changelog and submit it to fstests. Fails with current btrfs and passes with your patch (and with earlier proposal too). Let me know if it's ok for you. Thanks. I will rewrite the changelog and add comments, then resend the patch v2. And replace the titled to "Btrfs: fix
Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr
On Tue, Jul 31, 2018 at 10:51:57PM +, Mark Fasheh wrote: > Hi Al, > > On Tue, Jul 31, 2018 at 11:16:05PM +0100, Al Viro wrote: > > On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote: > > > ->getattr from inside the kernel won't always have a vfsmount, check for > > > this. > > > > Really? Where would that happen? > > It happens in my first patch. FWIW, I'm not tied to that particular bit of > code, or even this (latest) approach. I would actually very much appreciate > your input into how we might fix the problem we have where we are sometimes > not exporting a correct ino/dev pair to user space. Which callers would those be? I don't see anything in your series that wouldn't have vfsmount at hand... > I have a good break-down of the problem and possible solutions here: > > https://www.spinics.net/lists/linux-fsdevel/msg128003.html btrfs pulling odd crap with device numbers is a problem, but this is far from the most obvious unpleasantness caused by that - e.g. userland code expecting that unequal st_dev for directory and subdirectory means that we have a mountpoint there. Or assuming it can compare that to st_dev of mountpoints (or, indeed, the values in /proc/self/mountinfo) to find which fs it's on... /proc/*/maps is unfortunate, but it does contain the pathnames, doesn't it? What _real_ problems are there - the ones where we don't have a saner solution? Note that /proc/locks is (a) FPOS interface and (b) unsuitable for your approach anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] vfs: map unique ino/dev pairs for user space
Mark Fasheh: > The following patches fix these inconsistencies by introducing a VFS helper > function which calls the underlying filesystem ->getattr to get our real > inode number / device pair. The returned values can then be used at those > places in the kernel where we are incorrectly reporting our ino/dev pair. > We then update fs/proc/ and fs/locks.c to use this helper when writing to > /proc/PID/maps and /proc/locks respectively. I definitly agree that ino/dev pair should be a unique identity on the system. But I don't know why you are tryng to solve the problem in generic VFS layer instead of the problematic FS. Isn't it an unnecessary overhead for many FS? How about creating a new f_op member ->get_ino_dev(), ->show_identity() or something, and implement the new f_op in the problematic FS only? I hope it will be a lighter way to get the pair than generic getattr way. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace
Hi Jeff, On Wed, Aug 01, 2018 at 07:03:54AM -0400, Jeff Layton wrote: > On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote: > > /proc/locks does not always print the correct inode number/device pair. > > Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real, > > unique values for userspace. > > > > Signed-off-by: Mark Fasheh > > --- > > fs/locks.c | 12 +--- > > 1 file changed, 9 insertions(+), 3 deletions(-) > > > > diff --git a/fs/locks.c b/fs/locks.c > > index db7b6917d9c5..3a012df87fd8 100644 > > --- a/fs/locks.c > > +++ b/fs/locks.c > > @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, > > struct file_lock *fl, > > loff_t id, char *pfx) > > { > > struct inode *inode = NULL; > > + struct dentry *dentry; > > unsigned int fl_pid; > > struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; > > > > @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, > > struct file_lock *fl, > > if (fl_pid == 0) > > return; > > > > - if (fl->fl_file != NULL) > > + if (fl->fl_file != NULL) { > > inode = locks_inode(fl->fl_file); > > + dentry = file_dentry(fl->fl_file); > > + } > > > > seq_printf(f, "%lld:%s ", id, pfx); > > if (IS_POSIX(fl)) { > > @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, > > struct file_lock *fl, > >: (fl->fl_type == F_WRLCK) ? "WRITE" : "READ "); > > } > > if (inode) { > > + __u64 ino; > > + dev_t dev; > > + > > + vfs_map_unique_ino_dev(dentry, &ino, &dev); > > This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I > don't think it'll be ok to call ->getattr while holding a spinlock. Hmm, indeed it does call under spinlock. I'll hve to rethink how to approach this in fs/locks.c Thanks, --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Expected Received UUID
Hi, Here are the kernel details (uname -a): Device A (Debian 8.10): Linux DeviceA 3.16.0-4-amd64 #1 SMP Debian 3.16.51-3 (2017-12-13) x86_64 GNU/Linux Device B (Debian 8.11): Linux DeviceB 3.16.0-4-amd64 #1 SMP Debian 3.16.51-3 (2017-12-13) x86_64 GNU/Linux Device C (Debian 8.9): Linux DeviceC 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15) x86_64 GNU/Linux All devices run Debian Jessie but I updated btrfs-tools and btrfs-progs on devices B and C using the stretch repository to see the Received UUID details ( I had not come across python-btrfs earlier ) On Tue, Jul 31, 2018 at 4:37 PM, Hans van Kranenburg < hans.van.kranenb...@mendix.com> wrote: > Hi, > > Can you please report the kernel versions you are using on every system > (uname -a)? > > I would indeed expect the Received UUID value for C to have the same > uuid as the original UUID of A, so the b00e8ba1 one. > > The send part, generating the send stream is done by the running kernel. > The receive part is done by the userspace btrfs progs. The btrfs progs > receive code just sets the Received UUID to whatever it reads from the > send stream information. > > So, your sending kernel version is important here. > > When looking up the lines that send the UUID, in fs/btrfs/send.c, I can > see there was a problem like this in the past: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin > ux.git/commit/?id=b96b1db039ebc584d03a9933b279e0d3e704c528 > > I was introduced between linux 4.1 and 4.2 and solved in 2015 with that > commit, which ended up somewhere in 4.3 I think. > > From the btrfs-progs versions you list, I suspect this is a Debian > Jessie and 2x Debian Stretch, but the Debian 3.16 and 4.9 kernels would > not should not have this issue. > > On 07/31/2018 07:42 PM, Gaurav Sanghavi wrote: > > Hello everyone, > > > > While researching BTRFS for a project that involves backing up snapshots > > from device A to device B > > before consolidating backups from device B to device C, I noticed that > > received UUID on snapshot on > > device C is not the same as received UUID for the same snapshot on > > device B. Here are my steps: > > > > 1. Device A > > BTRFS version: v3.17 > > > > btrfs su sn -r /home/test/snapshot1 /home/test/snaps/snapshot1 > > btrfs su show /home/test/snaps/snapshot1 > > Name: snapshot1 > > uuid: b00e8ba1-5aaa-3641-9c4c-e168eee5c296 > > Parent uuid: cb570dec-e9fd-1f40-99d2-2070c8ee2466 > > Received UUID: --- > > Creation time: 2018-07-30 18:32:37 > > Flags: readonly > > > > 2. Send to Device B > > btrfs send /home/test/snaps/snapshot1 | ssh 'btrfs receive > > /home/backups/' > > > > After send completes, on Device B > > BTRFS version: v4.7.3 > > btrfs su show /home/backups/snapshot1 > > Name: snapshot1 > > UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c > > Parent UUID: a2314f7c-4b11-ed40-901f-f1acb5ebf802 > > Received UUID: b00e8ba1-5aaa-3641-9c4c-e168eee5c296 > > Creation time: 2018-07-30 18:42:37 -0700 > > Flags: readonly > > > > > > 3. Send to Device C > > btrfs send /home/backups/snapshot1 | ssh 'btrfs receive > > /home/backups2/' > > > > After send completes, on Device C > > BTRFS version: v4.7.3 > > btrfs su show /home/backups2/snapshot1 > > Name: snapshot1 > > UUID: 8a13aab5-8e44-2541-9082-bc583933b964 > > Parent UUID: 54e9b4ff-46dc-534e-b70f-69eb7bb21028 > > Received UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c > > Creation time: 2018-07-30 18:58:32 -0700 > > Flags: readonly > > > > 1. I have gone through some of the archived emails and have noticed > > people mentioning that > > if received UUID is set, btrfs send propogates this 'received UUID'. But > > in above case, > > it's different for the same snapshot on all three devices. Is this the > > expected behavior ? > > > > 2. We want to be able to start backing up from Device A to C, in case B > > goes down or needs > > to be replaced. If received UUID is expected to differ for the snapshot > > on device B and C, incremental > > backups will not work from A to C without setting received UUID. I have > > seen python-btrfs > > mentioned in a couple of emails; but have anyone of you used it in a > > production environment ? > > > > This is my first post to this email. Please let me know if I am missing > > any details. > > > > Thanks, > > Gaurav > > > > -- > Hans van Kranenburg >
Re: [PATCH] Btrfs: fix data lose with snapshot when nospace
On Wed, Aug 1, 2018 at 1:54 PM, Filipe Manana wrote: > On Wed, Aug 1, 2018 at 11:20 AM, robbieko wrote: >> Filipe Manana 於 2018-07-31 19:33 寫到: >> >>> On Tue, Jul 31, 2018 at 11:17 AM, robbieko wrote: Filipe Manana 於 2018-07-30 20:34 寫到: > On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana > wrote: >> >> >> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana >> wrote: >>> >>> >>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko >>> wrote: From: Robbie Ko Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") modified the nocow writeback mechanism, if you create a snapshot, it will always switch to cow writeback. This will cause data loss when there is no space, because when the space is full, the write will not reserve any space, only check if it can be nocow write. >>> >>> >>> >>> This is a bit vague. >>> You need to mention where space reservation does not happen (at the >>> time of the write syscall) and why, >>> and that the snapshot happens before flushing IO (running dealloc). >>> Then when running dealloc we fallback >>> to COW and fail. >>> >>> You also need to tell that although the write syscall did not return >>> an error, the writeback will >>> fail but a subsequent fsync on the file will return an error (ENOSPC) >>> because the writeback set the error >>> on the inode's mapping, so it's not completely a silent data loss, as >>> for buffered writes there's no guarantee >>> that if write syscall returns 0 the data will be persisted >>> successfully (that can only be guaranteed if a subsequent >>> fsync call returns 0). >>> So fix this by first flush the nocow data, and then switch to the cow write. >> >> >> >> I'm also not seeing how what you've done is better then we have now >> using the root->will_be_snapshotted atomic, >> which is essentially used the same way as the new atomic you are >> adding, and forces the writeback code no nocow >> writes as well. > > > > So what you have done can be made much more simple by flushing > delalloc before incrementing root->will_be_snapshotted instead of > after incrementing it: > > https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX There is no way to solve this problem in this modification. >>> >>> >>> It minimizes it. It only gives better guarantees that nocow buffered >>> writes that happened before calling the snapshot ioctl will not fall >>> back to cow, >>> not for the ones that happen while the call to the ioctl is happening. >>> When writing and create snapshot at the same time, the write will not reserve space, and will not return to ENOSPC, because will_be_snapshotted is still 0. So when writeback flush data, there will still be problems with ENOSPC. >>> >>> >>> Which is precisely what I proposed does without adding a new atomic >>> and more changes. >>> It flushes delalloc before incrementing root->will_be_snapshotted, so >>> that previous buffered nocow writes will not fallback to cow mode (and >>> require data space allocation). >>> >>> It only leaves a very tiny and very unlikely to hit (but not >>> impossible) time window where nocow writes will fallback >>> to cow mode - after calling start_delalloc_inodes() and before >>> incrementing root->will_be_snapshotted a new buffered write can comes >>> in and gets immediately flushed >>> because someone called fsync() on the file or the VM decided to >>> trigger writeback (due to memory pressure or some other reason). >>> >> >> It is very easy to reproduce. Not a tiny time. >> Because the time of start_delalloc_inodes will be very long. >> When the dirty page is reduced, the new write can continue >> to be written, and there is no reserve space. > > I see now, thanks. >> >> And these dirty pages are not necessarily flushed by >> start_delalloc_inodes because start_delalloc_inodes is >> checked from the front to the back, so when the new >> write is written to the previous position, it will not flush again. >> >> So when start_delalloc_inodes is executed, will_be_snapshotted is added, >> and all remaining dirty pages will be forced to turn to cow, all data is >> loss. >> e.g. 16G RAM, dirty ratio 20%, 16*0.2 = 3.2 GB data loss. >> >> Writing the inode that has been flushed will have the same problem. >> The inode that has been flushed will not be flushed for the second time. >> >> So you have better suggestions ? > > Not efficient ones (introducing more locks). > This one if fine, but please write a changelog that mentions all these > details from your replies. > The original one does not explain in detail the problem nor the solution. > Some comments before incrementing both atomics and flushing delalloc > at ioctl.c would also be good to ha
Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace
On Wed, Aug 01, 2018 at 08:37:31AM +0300, Amir Goldstein wrote: > > if (inode) { > > + __u64 ino; > > + dev_t dev; > > + > > + vfs_map_unique_ino_dev(dentry, &ino, &dev); > > /* userspace relies on this representation of dev_t */ > > seq_printf(f, "%d %02x:%02x:%ld ", fl_pid, > > - MAJOR(inode->i_sb->s_dev), > > - MINOR(inode->i_sb->s_dev), inode->i_ino); > > + MAJOR(dev), MINOR(dev), inode->i_ino); > > Don't you mean ,ino); ? Indeed I do, thanks for catching that one Amir. --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs
Hi Amir, On Wed, Aug 01, 2018 at 08:41:14AM +0300, Amir Goldstein wrote: > > +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev) > > I find this function name a bit more than function can guaranty. > It's just a fancy wrapper around ->getattr() > How about vfs_get_ino_dev() ? Yeah I agree with that. An early version actually had the name vfs_get_ino_dev(), I don't mind going back to it. Thanks for the review! --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] btrfs: add helper btrfs_num_devices() to deduce num_devices
On Thu, Jul 26, 2018 at 02:53:34PM +0800, Anand Jain wrote: > When the replace is running the fs_devices::num_devices also includes > the replace device, however in some operations like device delete and > balance it needs the actual num_devices without the repalce devices, so > now the function btrfs_num_devices() just provides that. The part how concurrent balance and dev-replace can be active at the same time is missing. As this is not obvious, it's desired to have that in the changelog. If there are questions and comments in past patchset revisions, that's a good material for changelogs and when you send an update you can enhance the changelogs. I do simple fixups or additions but when you send a new revision it the right time to save time on both sides. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and databases
On 2018-07-31 11:45 PM, MegaBrutal wrote: > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW > nature that is elsewhere a blessing, with databases it's a drawback). > But are there any advantages of still sticking to BTRFS for a database > albeit CoW is disabled, or should I just return to the old and > reliable ext4 for those applications? > Be very careful about nodatacow and btrfs 'raid'. BTRFS has no data synching mechanism for raid, so if your mirrors end up different somehow, your Array is going to be inconsistent. <>
Re: [PATCH v2 2/4] btrfs: fix race between free_stale_devices and close_fs_devices
On Thu, Jul 26, 2018 at 02:53:32PM +0800, Anand Jain wrote: > From: Anand Jain > > %fs_devices can be free-ed by btrfs_free_stale_devices() when the > close_fs_devices() drops fs_devices::opened to zero, but close_fs_devices > tries to access the %fs_devices again without the device_list_mutex. > > Fix this by bringing the %fs_devices access with in the device_list_mutex. AFAICS this cannot happen anymore because the two calls are serialized by the uuid_mutex. But this was not the case when syzbot reported the problem where your patch would apply. The parallell access to opened and device list cannot happen when: * btrfs_scan_one_device that wants to call btrfs_free_stale_devices * btrfs_close_devices calls close_fs_devices Fixed by the series: btrfs: lift uuid_mutex to callers of btrfs_scan_one_device btrfs: lift uuid_mutex to callers of btrfs_open_devices btrfs: lift uuid_mutex to callers of btrfs_parse_early_options btrfs: reorder initialization before the mount locks uuid_mutex btrfs: fix mount and ioctl device scan ioctl race If there's a race I don't see, please describe in more detail. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock
On 2018年08月01日 21:27, David Sterba wrote: > On Tue, Jul 31, 2018 at 02:52:23PM +0800, Qu Wenruo wrote: >> [BUG] >> For certains crafted image, whose csum root leaf has missing backref, if >> we try to trigger write with data csum, it could cause deadlock with the >> following kernel WARN_ON(): >> -- >> WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 >> CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 >> Workqueue: btrfs-endio-write btrfs_endio_write_helper >> RIP: 0010:btrfs_tree_lock+0x3e2/0x400 >> Call Trace: >> btrfs_alloc_tree_block+0x39f/0x770 >> __btrfs_cow_block+0x285/0x9e0 >> btrfs_cow_block+0x191/0x2e0 >> btrfs_search_slot+0x492/0x1160 >> btrfs_lookup_csum+0xec/0x280 >> btrfs_csum_file_blocks+0x2be/0xa60 >> add_pending_csums+0xaf/0xf0 >> btrfs_finish_ordered_io+0x74b/0xc90 >> finish_ordered_fn+0x15/0x20 >> normal_work_helper+0xf6/0x500 >> btrfs_endio_write_helper+0x12/0x20 >> process_one_work+0x302/0x770 >> worker_thread+0x81/0x6d0 >> kthread+0x180/0x1d0 >> ret_from_fork+0x35/0x40 >> ---[ end trace 2e85051acb5f6dc1 ]--- >> -- >> >> [CAUSE] >> That crafted image has missing backref for csum tree root leaf. >> And when we try to allocate new tree block, since there is no >> EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot >> and use it. > > I don't understand what exactly is the problem, can you please rephrase > that? Ie. what exactly is missing from where. If the problem is on the > higher level, regarding the csum tree, this could be checked much > earlier. I don't really like to see each and every callsite of tree > locks to require the error handling, so I'm exploring the possibilities > if this is necessary at all. This is a little complex. I'll try to explain this using some asciiart: Normal image | This fuzzed image --+ BG 29360128 | BG 29360128 One empty slot | One empty slot 29364224: backref to UUID tree| 29364224: backref to UUID tree Two empty slots | Two empty slots 29376512: backref to CSUM tree| One empty slot (bad type) 29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree ... | ... Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to alloc tree block, it's an valid slot for btrfs. And for finish_ordered_write, when we need to insert csum, we try to CoW csum tree root. Then extent allocator gives a new extent bytenr 29376512 length 4096. However bytenr 29376512 is our current csum root, and already locked during CoW. So in btrfs_alloc_tree_block() which calls btrfs_tree_lock() to lock the newly allocated tree block, we are trying to lock the tree block we have already locked, thus triggering the WARN_ON() and deadlock. In this particular fuzzed image, we can do key type check to detect early corruption, but it has the following problem: 1) Won't really solve the problem I could easily craft a image with backref for csum tree missing, and everything else is OK. It will still cause the problem. 2) May break later RO_compat features If later RO_compat feature introduces new key type in extent tree, in theory it should be able to be mounted RO, but will be rejected by the type check. So we have to go the way used in the patch. Another solution would be do backref check each time we try to read out a tree block, but that's too costly and may screw up locking. Thanks, Qu > >> However we have already locked csum tree root leaf by ourselves, thus we >> will ourselves to release read/write lock, and causing deadlock. >> >> [FIX] >> This patch will allow btrfs_tree_lock() to return error number, so >> most callers can exit gracefully. >> (Some caller doesn't really expect btrfs_tree_lock() to return error, >> and in that case, we use BUG_ON()) >> >> Since such problem can only happen in crafted image, we will still >> trigger kernel warning, but with a little more meaningful warning >> message. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 >> Reported-by: Xu Wen >> Signed-off-by: Qu Wenruo >> Reviewed-by: Su Yue >> --- >> changelog: >> v2: >> Fix pid_t output format from %llu to %d. Pointed by Su. >> Fix missing return 0 for btrfs_lock_tree(). >> Update reviewed-by tags. >> --- >> fs/btrfs/ctree.c | 57 +++--- >> fs/btrfs/extent-tree.c | 28 +++ >> fs/btrfs/extent_io.c | 8 -- >> fs/btrfs/free-space-tree.c | 4 ++- >> fs/btrfs/locking.c | 13 +++-- >> fs/btrfs/locking.h | 2 +- >> fs/btrfs/qgroup.c | 4 ++- >> fs/btrfs/relocation.c | 13 +++-- >> fs/btrfs/tree-log.c| 14 -- >> 9 files changed, 115 insertions(+), 28 deletions(-) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index 4bc326df47
Re: [PATCH v2] btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock
On Tue, Jul 31, 2018 at 02:52:23PM +0800, Qu Wenruo wrote: > [BUG] > For certains crafted image, whose csum root leaf has missing backref, if > we try to trigger write with data csum, it could cause deadlock with the > following kernel WARN_ON(): > -- > WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 > CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 > Workqueue: btrfs-endio-write btrfs_endio_write_helper > RIP: 0010:btrfs_tree_lock+0x3e2/0x400 > Call Trace: > btrfs_alloc_tree_block+0x39f/0x770 > __btrfs_cow_block+0x285/0x9e0 > btrfs_cow_block+0x191/0x2e0 > btrfs_search_slot+0x492/0x1160 > btrfs_lookup_csum+0xec/0x280 > btrfs_csum_file_blocks+0x2be/0xa60 > add_pending_csums+0xaf/0xf0 > btrfs_finish_ordered_io+0x74b/0xc90 > finish_ordered_fn+0x15/0x20 > normal_work_helper+0xf6/0x500 > btrfs_endio_write_helper+0x12/0x20 > process_one_work+0x302/0x770 > worker_thread+0x81/0x6d0 > kthread+0x180/0x1d0 > ret_from_fork+0x35/0x40 > ---[ end trace 2e85051acb5f6dc1 ]--- > -- > > [CAUSE] > That crafted image has missing backref for csum tree root leaf. > And when we try to allocate new tree block, since there is no > EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot > and use it. I don't understand what exactly is the problem, can you please rephrase that? Ie. what exactly is missing from where. If the problem is on the higher level, regarding the csum tree, this could be checked much earlier. I don't really like to see each and every callsite of tree locks to require the error handling, so I'm exploring the possibilities if this is necessary at all. > However we have already locked csum tree root leaf by ourselves, thus we > will ourselves to release read/write lock, and causing deadlock. > > [FIX] > This patch will allow btrfs_tree_lock() to return error number, so > most callers can exit gracefully. > (Some caller doesn't really expect btrfs_tree_lock() to return error, > and in that case, we use BUG_ON()) > > Since such problem can only happen in crafted image, we will still > trigger kernel warning, but with a little more meaningful warning > message. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405 > Reported-by: Xu Wen > Signed-off-by: Qu Wenruo > Reviewed-by: Su Yue > --- > changelog: > v2: > Fix pid_t output format from %llu to %d. Pointed by Su. > Fix missing return 0 for btrfs_lock_tree(). > Update reviewed-by tags. > --- > fs/btrfs/ctree.c | 57 +++--- > fs/btrfs/extent-tree.c | 28 +++ > fs/btrfs/extent_io.c | 8 -- > fs/btrfs/free-space-tree.c | 4 ++- > fs/btrfs/locking.c | 13 +++-- > fs/btrfs/locking.h | 2 +- > fs/btrfs/qgroup.c | 4 ++- > fs/btrfs/relocation.c | 13 +++-- > fs/btrfs/tree-log.c| 14 -- > 9 files changed, 115 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 4bc326df472e..6e695f4cd931 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -161,10 +161,12 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root > *root) > struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root) > { > struct extent_buffer *eb; > + int ret; > > while (1) { > eb = btrfs_root_node(root); > - btrfs_tree_lock(eb); > + ret = btrfs_tree_lock(eb); > + BUG_ON(ret < 0); > if (eb == root->node) > break; > btrfs_tree_unlock(eb); > @@ -1584,6 +1586,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, > for (i = start_slot; i <= end_slot; i++) { > struct btrfs_key first_key; > int close = 1; > + int ret; > > btrfs_node_key(parent, &disk_key, i); > if (!progress_passed && comp_keys(&disk_key, progress) < 0) > @@ -1637,7 +1640,11 @@ int btrfs_realloc_node(struct btrfs_trans_handle > *trans, > if (search_start == 0) > search_start = last_block; > > - btrfs_tree_lock(cur); > + ret = btrfs_tree_lock(cur); > + if (ret < 0) { > + free_extent_buffer(cur); > + return ret; > + } > btrfs_set_lock_blocking(cur); > err = __btrfs_cow_block(trans, root, cur, parent, i, > &cur, search_start, > @@ -1853,7 +1860,11 @@ static noinline int balance_level(struct > btrfs_trans_handle *trans, > goto enospc; > } > > - btrfs_tree_lock(child); > + ret = btrfs_tree_lock(child); > + if (ret < 0) { > + free_extent_buffer(child); > + goto enospc; > + } > btrfs_set_lock_blocking(child); > ret = btrfs_co
Re: [PATCH] btrfs: replace: Reset on-disk dev stats value after replace
On Tue, Jul 31, 2018 at 04:20:21PM +0900, Misono Tomohiro wrote: > on-disk devs stats value is updated in btrfs_run_dev_stats(), > which is called during commit transaction, if device->dev_stats_ccnt > is not zero. > > Since current replace operation does not touch dev_stats_ccnt, > on-disk dev stats value is not updated. Therefore "btrfs device stats" > may return old device's value after umount/mount > (Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish). In such cases it's good to have a snippet of the output, so it's possible to match the output and check if we see the same thing. Otherwise looks ok, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions
On Tue, Jul 24, 2018 at 10:35:28PM +0200, Adam Borowski wrote: > On Mon, Jul 23, 2018 at 05:26:24PM +0200, David Sterba wrote: > > On Wed, Jul 18, 2018 at 12:08:59AM +0200, Adam Borowski wrote: > (Combined with as-folded) > > | | btrfs: allow defrag on a file opened read-only that has rw permissions > | | > > > Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY > > > whenever you try to defrag a program that's currently being run, or > > > causing intermittent exec failures on a live system being defragged. > > > > > > As defrag doesn't change the file's contents in any way, there's no reason > > > to consider it a rw operation. Thus, let's check only whether the file > > > could have been opened rw. Such access control is still needed as > > > currently defrag can use extra disk space, and might trigger bugs. > <- > | | We give EINVAL when the request is invalid; here it's ok but merely the > | | user has insufficient privileges. Thus, this return value reflects the > | | error better -- as discussed in the identical case for dedupe. > | | > | | According to codesearch.debian.net, no userspace program distinguishes > | | these values beyond strerror(). > | | > | | Signed-off-by: Adam Borowski > | | Reviewed-by: David Sterba > | | [ fold the EPERM patch from Adam ] > | | Signed-off-by: David Sterba > > [...] > > So, I'll add the patch to 4.19 queue. It's small and isolated change so > > a revert would be easy in case we find something bad. The 2nd patch > > should be IMHO part of this change as it's logical to return the error > > code in the patch that modifies the user visible behaviour. > > A nitpick: the new commit message has a dangling pointer "this" to the title > of the commit that was squashed. It was: > > | btrfs: defrag: return EPERM not EINVAL when only permissions fail > > It'd be nice if it could be inserted in some form in the place I marked with > an arrow. Right. I've replaced 'this' with 'EPERM', thanks for catching it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: extent-tree.c: Remove unused __btrfs_free_block_rsv()
On Thu, Jul 26, 2018 at 11:40:54AM +0900, Misono Tomohiro wrote: > There is no user of this function. > > This is forgotten to get removed in commit a575ceeb1338 > ("Btrfs: get rid of unused orphan infrastructure"). > > Signed-off-by: Misono Tomohiro Added to misc-next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: backref.c: Use ERR_CAST() to return error code
On Thu, Jul 26, 2018 at 10:22:58AM +0900, Misono Tomohiro wrote: > Use ERR_CAST() instead of void * to make meaning clear. > > Signed-off-by: Misono Tomohiro Added to misc-next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix data lose with snapshot when nospace
On Wed, Aug 1, 2018 at 11:20 AM, robbieko wrote: > Filipe Manana 於 2018-07-31 19:33 寫到: > >> On Tue, Jul 31, 2018 at 11:17 AM, robbieko wrote: >>> >>> Filipe Manana 於 2018-07-30 20:34 寫到: >>> On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana wrote: > > > On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana > wrote: >> >> >> On Mon, Jul 30, 2018 at 11:21 AM, robbieko >> wrote: >>> >>> >>> From: Robbie Ko >>> >>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") >>> modified the nocow writeback mechanism, if you create a snapshot, >>> it will always switch to cow writeback. >>> >>> This will cause data loss when there is no space, because >>> when the space is full, the write will not reserve any space, only >>> check if it can be nocow write. >> >> >> >> This is a bit vague. >> You need to mention where space reservation does not happen (at the >> time of the write syscall) and why, >> and that the snapshot happens before flushing IO (running dealloc). >> Then when running dealloc we fallback >> to COW and fail. >> >> You also need to tell that although the write syscall did not return >> an error, the writeback will >> fail but a subsequent fsync on the file will return an error (ENOSPC) >> because the writeback set the error >> on the inode's mapping, so it's not completely a silent data loss, as >> for buffered writes there's no guarantee >> that if write syscall returns 0 the data will be persisted >> successfully (that can only be guaranteed if a subsequent >> fsync call returns 0). >> >>> >>> So fix this by first flush the nocow data, and then switch to the >>> cow write. > > > > I'm also not seeing how what you've done is better then we have now > using the root->will_be_snapshotted atomic, > which is essentially used the same way as the new atomic you are > adding, and forces the writeback code no nocow > writes as well. So what you have done can be made much more simple by flushing delalloc before incrementing root->will_be_snapshotted instead of after incrementing it: https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX >>> >>> >>> There is no way to solve this problem in this modification. >> >> >> It minimizes it. It only gives better guarantees that nocow buffered >> writes that happened before calling the snapshot ioctl will not fall >> back to cow, >> not for the ones that happen while the call to the ioctl is happening. >> >>> >>> When writing and create snapshot at the same time, the write will not >>> reserve space, >>> and will not return to ENOSPC, because will_be_snapshotted is still 0. >>> So when writeback flush data, there will still be problems with ENOSPC. >> >> >> Which is precisely what I proposed does without adding a new atomic >> and more changes. >> It flushes delalloc before incrementing root->will_be_snapshotted, so >> that previous buffered nocow writes will not fallback to cow mode (and >> require data space allocation). >> >> It only leaves a very tiny and very unlikely to hit (but not >> impossible) time window where nocow writes will fallback >> to cow mode - after calling start_delalloc_inodes() and before >> incrementing root->will_be_snapshotted a new buffered write can comes >> in and gets immediately flushed >> because someone called fsync() on the file or the VM decided to >> trigger writeback (due to memory pressure or some other reason). >> > > It is very easy to reproduce. Not a tiny time. > Because the time of start_delalloc_inodes will be very long. > When the dirty page is reduced, the new write can continue > to be written, and there is no reserve space. I see now, thanks. > > And these dirty pages are not necessarily flushed by > start_delalloc_inodes because start_delalloc_inodes is > checked from the front to the back, so when the new > write is written to the previous position, it will not flush again. > > So when start_delalloc_inodes is executed, will_be_snapshotted is added, > and all remaining dirty pages will be forced to turn to cow, all data is > loss. > e.g. 16G RAM, dirty ratio 20%, 16*0.2 = 3.2 GB data loss. > > Writing the inode that has been flushed will have the same problem. > The inode that has been flushed will not be flushed for the second time. > > So you have better suggestions ? Not efficient ones (introducing more locks). This one if fine, but please write a changelog that mentions all these details from your replies. The original one does not explain in detail the problem nor the solution. Some comments before incrementing both atomics and flushing delalloc at ioctl.c would also be good to have. And a test case for fstests too. Thanks. > > >>> >>> The behavior I changed was to increase will_be_snapshotted first, >>> so the following write must have a reserve space, >>> other
Re: [PATCH 0/7] Some unused parameter cleanup
On Wed, Aug 01, 2018 at 11:32:24AM +0800, Lu Fengqi wrote: > PATCH 1-2 remove all unused fs_info parameter from delayed-inode.c > PATCH 3-5 remove all unused fs_info parameter from root-tree.c > PATCH 6-7 some cleanup for btrfs_unlink_subvol Thanks, added to misc-next. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
On 1.08.2018 15:19, Qu Wenruo wrote: > > > On 2018年08月01日 20:12, Nikolay Borisov wrote: >> >> >> On 1.08.2018 14:13, Qu Wenruo wrote: >>> >>> >>> On 2018年08月01日 18:08, Nikolay Borisov wrote: On 1.08.2018 11:08, Qu Wenruo wrote: > [BUG] > When mounting certain crafted image, btrfs will trigger kernel BUG_ON() > when try to recover balance: > -- > [ cut here ] > kernel BUG at fs/btrfs/extent-tree.c:8956! > invalid opcode: [#1] PREEMPT SMP NOPTI > CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 > RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] > RSP: 0018:b53540c9b890 EFLAGS: 00010202 > Call Trace: > walk_up_tree+0x172/0x1f0 [btrfs] > btrfs_drop_snapshot+0x3a4/0x830 [btrfs] > merge_reloc_roots+0xe1/0x1d0 [btrfs] > btrfs_recover_relocation+0x3ea/0x420 [btrfs] > open_ctree+0x1af3/0x1dd0 [btrfs] > btrfs_mount_root+0x66b/0x740 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > btrfs_mount+0x16d/0x890 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > do_mount+0x1fd/0xda0 > ksys_mount+0xba/0xd0 > __x64_sys_mount+0x21/0x30 > do_syscall_64+0x60/0x210 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > ---[ end trace d4344e4deee03435 ]--- > -- > > [CAUSE] > Another extent tree corruption. > > In this particular case, tree reloc root's owner is > DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is > corrupted and we failed the owner check in walk_up_tree(). > > [FIX] > It's pretty hard to take care of every extent tree corruption, but at > least we can remove such BUG_ON() and exit more gracefully. > > And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE > shares the same root (which is obviously invalid), we needs to make > __del_reloc_root() more robust to detect such invalid share to avoid > possible NULL dereference as root->node can be NULL in this case. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 > Reported-by: Xu Wen > Signed-off-by: Qu Wenruo > --- > As always, the patch is also pushed to my github repo, along with other > fuzzed images related fixes: > https://github.com/adam900710/linux/tree/tree_checker_enhance > (BTW, is it correct to indicate a branch like above?) > --- > fs/btrfs/extent-tree.c | 27 +++ > fs/btrfs/relocation.c | 2 +- > 2 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index da615ebc072e..5f4ca61348b5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct > btrfs_trans_handle *trans, > } > > if (eb == root->node) { > - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) > + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { > parent = eb->start; > - else > - BUG_ON(root->root_key.objectid != > -btrfs_header_owner(eb)); > + } else if (root->root_key.objectid != btrfs_header_owner(eb)) { > + btrfs_err_rl(fs_info, > + "unexpected tree owner, have %llu expect %llu", > + btrfs_header_owner(eb), > + root->root_key.objectid); > + return -EINVAL; EINVAL or ECLEANUP? >>> >>> Yep, also my concern here. >>> >>> I have no bias here, and both makes its sense here. >>> >>> EUCLEAN means it's something unexpected, but normally it's used in >>> static check, no sure if it suits for runtime check. >> >> My thinking goes if something is an on-disk error (and fuzzed images >> fall in that category) then we should return EUCLEAN. If the owner can >> be mismatched only as a result of erroneous data on-disk which is then >> read and subsequently this code triggers then it's something induced due >> to an on-disk error. > > Makes sense. > > Does it also mean later BUG_ON() convert would also use EUCLEAN as most > BUG_ON() is either some real bug or corrupted/fuzzed images? If you refer to the next hunk the patch then I'd say yes. > > Thanks, > Qu > >> >>> >>> Although EINVAL looks more suitable for runtime error, it is not a >>> perfect errno either, as it's not something invalid from user, but the >>> fs has something unexpected. >>> >>> I'm all ears on this errno issue. >>> >>> Thanks, >>> Qu >>> > + } > } else { > - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) > + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { > parent = path->nodes[level + 1]->s
Re: BTRFS and databases
On 2018-07-31 23:45, MegaBrutal wrote: Hi all, I know it's a decade-old question, but I'd like to hear your thoughts of today. By now, I became a heavy BTRFS user. Almost everywhere I use BTRFS, except in situations when it is obvious there is no benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and server computers are mainly running on BTRFS with only a few file systems on ext4. I even installed BTRFS in corporate productive systems (in those cases, the systems were mainly on ext4; but there were some specific file systems those exploited BTRFS features). But there is still one question that I can't get over: if you store a database (e.g. MySQL), would you prefer having a BTRFS volume mounted with nodatacow, or would you just simply use ext4? I know that with nodatacow, I take away most of the benefits of BTRFS (those are actually hurting database performance – the exact CoW nature that is elsewhere a blessing, with databases it's a drawback). But are there any advantages of still sticking to BTRFS for a database albeit CoW is disabled, or should I just return to the old and reliable ext4 for those applications? You still have snapshotting, which can be useful if you for some reason can't just dump all the tables to SQL for backups (but seriously, that's _really_ what you should be doing to back up your database, it decouples it from whatever back-end storage you're using). You also still have the guaranteed metadata consistency (nodatacow disables COW for data chunks only, COW still happens for metadata chunks), but you can get (close) to that with the newest versions of XFS too. If you want to use BTRFS, I'd suggest playing around with the different on-disk storage formats offered by your RDBMS (MySQL in this case). Some of the offer measurably better performance on BTRFS than others, but it's at least partially dependent on how your software uses that database. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
On 2018年08月01日 20:12, Nikolay Borisov wrote: > > > On 1.08.2018 14:13, Qu Wenruo wrote: >> >> >> On 2018年08月01日 18:08, Nikolay Borisov wrote: >>> >>> >>> On 1.08.2018 11:08, Qu Wenruo wrote: [BUG] When mounting certain crafted image, btrfs will trigger kernel BUG_ON() when try to recover balance: -- [ cut here ] kernel BUG at fs/btrfs/extent-tree.c:8956! invalid opcode: [#1] PREEMPT SMP NOPTI CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] RSP: 0018:b53540c9b890 EFLAGS: 00010202 Call Trace: walk_up_tree+0x172/0x1f0 [btrfs] btrfs_drop_snapshot+0x3a4/0x830 [btrfs] merge_reloc_roots+0xe1/0x1d0 [btrfs] btrfs_recover_relocation+0x3ea/0x420 [btrfs] open_ctree+0x1af3/0x1dd0 [btrfs] btrfs_mount_root+0x66b/0x740 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 btrfs_mount+0x16d/0x890 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 do_mount+0x1fd/0xda0 ksys_mount+0xba/0xd0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace d4344e4deee03435 ]--- -- [CAUSE] Another extent tree corruption. In this particular case, tree reloc root's owner is DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is corrupted and we failed the owner check in walk_up_tree(). [FIX] It's pretty hard to take care of every extent tree corruption, but at least we can remove such BUG_ON() and exit more gracefully. And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE shares the same root (which is obviously invalid), we needs to make __del_reloc_root() more robust to detect such invalid share to avoid possible NULL dereference as root->node can be NULL in this case. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 Reported-by: Xu Wen Signed-off-by: Qu Wenruo --- As always, the patch is also pushed to my github repo, along with other fuzzed images related fixes: https://github.com/adam900710/linux/tree/tree_checker_enhance (BTW, is it correct to indicate a branch like above?) --- fs/btrfs/extent-tree.c | 27 +++ fs/btrfs/relocation.c | 2 +- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index da615ebc072e..5f4ca61348b5 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans, } if (eb == root->node) { - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { parent = eb->start; - else - BUG_ON(root->root_key.objectid != - btrfs_header_owner(eb)); + } else if (root->root_key.objectid != btrfs_header_owner(eb)) { + btrfs_err_rl(fs_info, + "unexpected tree owner, have %llu expect %llu", + btrfs_header_owner(eb), + root->root_key.objectid); + return -EINVAL; >>> >>> EINVAL or ECLEANUP? >> >> Yep, also my concern here. >> >> I have no bias here, and both makes its sense here. >> >> EUCLEAN means it's something unexpected, but normally it's used in >> static check, no sure if it suits for runtime check. > > My thinking goes if something is an on-disk error (and fuzzed images > fall in that category) then we should return EUCLEAN. If the owner can > be mismatched only as a result of erroneous data on-disk which is then > read and subsequently this code triggers then it's something induced due > to an on-disk error. Makes sense. Does it also mean later BUG_ON() convert would also use EUCLEAN as most BUG_ON() is either some real bug or corrupted/fuzzed images? Thanks, Qu > >> >> Although EINVAL looks more suitable for runtime error, it is not a >> perfect errno either, as it's not something invalid from user, but the >> fs has something unexpected. >> >> I'm all ears on this errno issue. >> >> Thanks, >> Qu >> >>> + } } else { - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { parent = path->nodes[level + 1]->start; - else - BUG_ON(root->root_key.objectid != - btrfs_header_owner(path->nodes[level + 1])); + } else if (root->root_key.objectid != +
Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
On 1.08.2018 14:13, Qu Wenruo wrote: > > > On 2018年08月01日 18:08, Nikolay Borisov wrote: >> >> >> On 1.08.2018 11:08, Qu Wenruo wrote: >>> [BUG] >>> When mounting certain crafted image, btrfs will trigger kernel BUG_ON() >>> when try to recover balance: >>> -- >>> [ cut here ] >>> kernel BUG at fs/btrfs/extent-tree.c:8956! >>> invalid opcode: [#1] PREEMPT SMP NOPTI >>> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 >>> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] >>> RSP: 0018:b53540c9b890 EFLAGS: 00010202 >>> Call Trace: >>> walk_up_tree+0x172/0x1f0 [btrfs] >>> btrfs_drop_snapshot+0x3a4/0x830 [btrfs] >>> merge_reloc_roots+0xe1/0x1d0 [btrfs] >>> btrfs_recover_relocation+0x3ea/0x420 [btrfs] >>> open_ctree+0x1af3/0x1dd0 [btrfs] >>> btrfs_mount_root+0x66b/0x740 [btrfs] >>> mount_fs+0x3b/0x16a >>> vfs_kern_mount.part.9+0x54/0x140 >>> btrfs_mount+0x16d/0x890 [btrfs] >>> mount_fs+0x3b/0x16a >>> vfs_kern_mount.part.9+0x54/0x140 >>> do_mount+0x1fd/0xda0 >>> ksys_mount+0xba/0xd0 >>> __x64_sys_mount+0x21/0x30 >>> do_syscall_64+0x60/0x210 >>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> ---[ end trace d4344e4deee03435 ]--- >>> -- >>> >>> [CAUSE] >>> Another extent tree corruption. >>> >>> In this particular case, tree reloc root's owner is >>> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is >>> corrupted and we failed the owner check in walk_up_tree(). >>> >>> [FIX] >>> It's pretty hard to take care of every extent tree corruption, but at >>> least we can remove such BUG_ON() and exit more gracefully. >>> >>> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE >>> shares the same root (which is obviously invalid), we needs to make >>> __del_reloc_root() more robust to detect such invalid share to avoid >>> possible NULL dereference as root->node can be NULL in this case. >>> >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 >>> Reported-by: Xu Wen >>> Signed-off-by: Qu Wenruo >>> --- >>> As always, the patch is also pushed to my github repo, along with other >>> fuzzed images related fixes: >>> https://github.com/adam900710/linux/tree/tree_checker_enhance >>> (BTW, is it correct to indicate a branch like above?) >>> --- >>> fs/btrfs/extent-tree.c | 27 +++ >>> fs/btrfs/relocation.c | 2 +- >>> 2 files changed, 20 insertions(+), 9 deletions(-) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index da615ebc072e..5f4ca61348b5 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct >>> btrfs_trans_handle *trans, >>> } >>> >>> if (eb == root->node) { >>> - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) >>> + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { >>> parent = eb->start; >>> - else >>> - BUG_ON(root->root_key.objectid != >>> - btrfs_header_owner(eb)); >>> + } else if (root->root_key.objectid != btrfs_header_owner(eb)) { >>> + btrfs_err_rl(fs_info, >>> + "unexpected tree owner, have %llu expect %llu", >>> +btrfs_header_owner(eb), >>> +root->root_key.objectid); >>> + return -EINVAL; >> >> EINVAL or ECLEANUP? > > Yep, also my concern here. > > I have no bias here, and both makes its sense here. > > EUCLEAN means it's something unexpected, but normally it's used in > static check, no sure if it suits for runtime check. My thinking goes if something is an on-disk error (and fuzzed images fall in that category) then we should return EUCLEAN. If the owner can be mismatched only as a result of erroneous data on-disk which is then read and subsequently this code triggers then it's something induced due to an on-disk error. > > Although EINVAL looks more suitable for runtime error, it is not a > perfect errno either, as it's not something invalid from user, but the > fs has something unexpected. > > I'm all ears on this errno issue. > > Thanks, > Qu > >> >>> + } >>> } else { >>> - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) >>> + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { >>> parent = path->nodes[level + 1]->start; >>> - else >>> - BUG_ON(root->root_key.objectid != >>> - btrfs_header_owner(path->nodes[level + 1])); >>> + } else if (root->root_key.objectid != >>> + btrfs_header_owner(path->nodes[level + 1])) { >>> + btrfs_err_rl(fs_info, >>> + "unexpected tree owner, have %llu expect %llu", >>> +btrfs_header_owner(eb), >>> +root->root_key.objectid); >>
Re: Unmountable root partition
Yes, command output is as is, because I just copied and pasted into the mail. When I omit the `-t btrfs` part, result is the same. I'm now trying to rescue what I can, so getting a image dump with `ddrescue`. It's read about 25% without any errors but it will be expected to finish in 6 hours. Then I'll try btrfscue to see what happens. I know it's totally my fault because I must be ready for a total disk/pc/building burn out. Lessons learned. 2018-08-01 8:32 GMT+03:00 Chris Murphy : > On Tue, Jul 31, 2018 at 12:03 PM, Cerem Cem ASLAN > wrote: > > > 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo > > Gives the following error: > > > > mount: wrong fs type, bad option, bad superblock on ... > > > > 4. dmesg | tail > > Outputs the following: > > > > > > [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result: > > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > > [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 02 > 00 00 > > 02 00 > > [17755.840921] blk_update_request: I/O error, dev sda, sector 507906 > > [17755.840941] EXT4-fs (dm-4): unable to read superblock > > > Are you sure this is the output for the command? Because you're > explicitly asking for type btrfs, which fails, and then the kernel > reports EXT4 superblock unreadable. What do you get if you omit -t > btrfs and just let it autodetect? > > But yeah, this is an IO error from the device and there's nothing > Btrfs can do about that unless there is DUP or raid1+ metadata > available. > > Is it possible this LV was accidentally reformatted ext4? > > > -- > Chris Murphy >
Re: BTRFS and databases
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote: > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? > > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW > nature that is elsewhere a blessing, with databases it's a drawback). > But are there any advantages of still sticking to BTRFS for a database > albeit CoW is disabled, or should I just return to the old and > reliable ext4 for those applications? Is this database performance-critical? If yes, you'd want ext4 -- nocow is a crappy ext4 lookalike, with no benefits of btrfs. Or, if you snapshot it, you get bad fragmentation yet no checksums/etc. If no, regular cow (especially with autodefrag) will be enough. Sure, this particular load won't be as performant (mysql really loves fsync, which is an anathema to btrfs), but you get all the data safety improvements, frequent cheap backups, and so on. Thus: if the server's primary purpose is that database, you don't want btrfs. If the database is merely incidental, not microoptimizing it will save a lot of your time. In neither case nocow is a good idea. Especially if raid (!= 0) is involved. Meow! -- // If you believe in so-called "intellectual property", please immediately // cease using counterfeit alphabets. Instead, contact the nearest temple // of Amon, whose priests will provide you with scribal services for all // your writing needs, for Reasonable And Non-Discriminatory prices. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
On 2018年08月01日 18:08, Nikolay Borisov wrote: > > > On 1.08.2018 11:08, Qu Wenruo wrote: >> [BUG] >> When mounting certain crafted image, btrfs will trigger kernel BUG_ON() >> when try to recover balance: >> -- >> [ cut here ] >> kernel BUG at fs/btrfs/extent-tree.c:8956! >> invalid opcode: [#1] PREEMPT SMP NOPTI >> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 >> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] >> RSP: 0018:b53540c9b890 EFLAGS: 00010202 >> Call Trace: >> walk_up_tree+0x172/0x1f0 [btrfs] >> btrfs_drop_snapshot+0x3a4/0x830 [btrfs] >> merge_reloc_roots+0xe1/0x1d0 [btrfs] >> btrfs_recover_relocation+0x3ea/0x420 [btrfs] >> open_ctree+0x1af3/0x1dd0 [btrfs] >> btrfs_mount_root+0x66b/0x740 [btrfs] >> mount_fs+0x3b/0x16a >> vfs_kern_mount.part.9+0x54/0x140 >> btrfs_mount+0x16d/0x890 [btrfs] >> mount_fs+0x3b/0x16a >> vfs_kern_mount.part.9+0x54/0x140 >> do_mount+0x1fd/0xda0 >> ksys_mount+0xba/0xd0 >> __x64_sys_mount+0x21/0x30 >> do_syscall_64+0x60/0x210 >> entry_SYSCALL_64_after_hwframe+0x49/0xbe >> ---[ end trace d4344e4deee03435 ]--- >> -- >> >> [CAUSE] >> Another extent tree corruption. >> >> In this particular case, tree reloc root's owner is >> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is >> corrupted and we failed the owner check in walk_up_tree(). >> >> [FIX] >> It's pretty hard to take care of every extent tree corruption, but at >> least we can remove such BUG_ON() and exit more gracefully. >> >> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE >> shares the same root (which is obviously invalid), we needs to make >> __del_reloc_root() more robust to detect such invalid share to avoid >> possible NULL dereference as root->node can be NULL in this case. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 >> Reported-by: Xu Wen >> Signed-off-by: Qu Wenruo >> --- >> As always, the patch is also pushed to my github repo, along with other >> fuzzed images related fixes: >> https://github.com/adam900710/linux/tree/tree_checker_enhance >> (BTW, is it correct to indicate a branch like above?) >> --- >> fs/btrfs/extent-tree.c | 27 +++ >> fs/btrfs/relocation.c | 2 +- >> 2 files changed, 20 insertions(+), 9 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index da615ebc072e..5f4ca61348b5 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct >> btrfs_trans_handle *trans, >> } >> >> if (eb == root->node) { >> -if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) >> +if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { >> parent = eb->start; >> -else >> -BUG_ON(root->root_key.objectid != >> - btrfs_header_owner(eb)); >> +} else if (root->root_key.objectid != btrfs_header_owner(eb)) { >> +btrfs_err_rl(fs_info, >> +"unexpected tree owner, have %llu expect %llu", >> + btrfs_header_owner(eb), >> + root->root_key.objectid); >> +return -EINVAL; > > EINVAL or ECLEANUP? Yep, also my concern here. I have no bias here, and both makes its sense here. EUCLEAN means it's something unexpected, but normally it's used in static check, no sure if it suits for runtime check. Although EINVAL looks more suitable for runtime error, it is not a perfect errno either, as it's not something invalid from user, but the fs has something unexpected. I'm all ears on this errno issue. Thanks, Qu > >> +} >> } else { >> -if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) >> +if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { >> parent = path->nodes[level + 1]->start; >> -else >> -BUG_ON(root->root_key.objectid != >> - btrfs_header_owner(path->nodes[level + 1])); >> +} else if (root->root_key.objectid != >> + btrfs_header_owner(path->nodes[level + 1])) { >> +btrfs_err_rl(fs_info, >> +"unexpected tree owner, have %llu expect %llu", >> + btrfs_header_owner(eb), >> + root->root_key.objectid); >> +return -EINVAL; > ditto >> +} >> } >> >> btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1); >> @@ -9020,6 +9029,8 @@ static noinline int walk_up_tree(struct >> btrfs_trans_handle *trans, >> ret = walk_up_proc(trans, root, path, wc); >> if (ret > 0) >> return 0; >> +if (ret < 0) >> +
Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace
On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote: > /proc/locks does not always print the correct inode number/device pair. > Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real, > unique values for userspace. > > Signed-off-by: Mark Fasheh > --- > fs/locks.c | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/locks.c b/fs/locks.c > index db7b6917d9c5..3a012df87fd8 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct > file_lock *fl, > loff_t id, char *pfx) > { > struct inode *inode = NULL; > + struct dentry *dentry; > unsigned int fl_pid; > struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; > > @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct > file_lock *fl, > if (fl_pid == 0) > return; > > - if (fl->fl_file != NULL) > + if (fl->fl_file != NULL) { > inode = locks_inode(fl->fl_file); > + dentry = file_dentry(fl->fl_file); > + } > > seq_printf(f, "%lld:%s ", id, pfx); > if (IS_POSIX(fl)) { > @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, > struct file_lock *fl, > : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ "); > } > if (inode) { > + __u64 ino; > + dev_t dev; > + > + vfs_map_unique_ino_dev(dentry, &ino, &dev); This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I don't think it'll be ok to call ->getattr while holding a spinlock. > /* userspace relies on this representation of dev_t */ > seq_printf(f, "%d %02x:%02x:%ld ", fl_pid, > - MAJOR(inode->i_sb->s_dev), > - MINOR(inode->i_sb->s_dev), inode->i_ino); > + MAJOR(dev), MINOR(dev), inode->i_ino); > } else { > seq_printf(f, "%d :0 ", fl_pid); > } -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix data lose with snapshot when nospace
Filipe Manana 於 2018-07-31 19:33 寫到: On Tue, Jul 31, 2018 at 11:17 AM, robbieko wrote: Filipe Manana 於 2018-07-30 20:34 寫到: On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana wrote: On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana wrote: On Mon, Jul 30, 2018 at 11:21 AM, robbieko wrote: From: Robbie Ko Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") modified the nocow writeback mechanism, if you create a snapshot, it will always switch to cow writeback. This will cause data loss when there is no space, because when the space is full, the write will not reserve any space, only check if it can be nocow write. This is a bit vague. You need to mention where space reservation does not happen (at the time of the write syscall) and why, and that the snapshot happens before flushing IO (running dealloc). Then when running dealloc we fallback to COW and fail. You also need to tell that although the write syscall did not return an error, the writeback will fail but a subsequent fsync on the file will return an error (ENOSPC) because the writeback set the error on the inode's mapping, so it's not completely a silent data loss, as for buffered writes there's no guarantee that if write syscall returns 0 the data will be persisted successfully (that can only be guaranteed if a subsequent fsync call returns 0). So fix this by first flush the nocow data, and then switch to the cow write. I'm also not seeing how what you've done is better then we have now using the root->will_be_snapshotted atomic, which is essentially used the same way as the new atomic you are adding, and forces the writeback code no nocow writes as well. So what you have done can be made much more simple by flushing delalloc before incrementing root->will_be_snapshotted instead of after incrementing it: https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX There is no way to solve this problem in this modification. It minimizes it. It only gives better guarantees that nocow buffered writes that happened before calling the snapshot ioctl will not fall back to cow, not for the ones that happen while the call to the ioctl is happening. When writing and create snapshot at the same time, the write will not reserve space, and will not return to ENOSPC, because will_be_snapshotted is still 0. So when writeback flush data, there will still be problems with ENOSPC. Which is precisely what I proposed does without adding a new atomic and more changes. It flushes delalloc before incrementing root->will_be_snapshotted, so that previous buffered nocow writes will not fallback to cow mode (and require data space allocation). It only leaves a very tiny and very unlikely to hit (but not impossible) time window where nocow writes will fallback to cow mode - after calling start_delalloc_inodes() and before incrementing root->will_be_snapshotted a new buffered write can comes in and gets immediately flushed because someone called fsync() on the file or the VM decided to trigger writeback (due to memory pressure or some other reason). It is very easy to reproduce. Not a tiny time. Because the time of start_delalloc_inodes will be very long. When the dirty page is reduced, the new write can continue to be written, and there is no reserve space. And these dirty pages are not necessarily flushed by start_delalloc_inodes because start_delalloc_inodes is checked from the front to the back, so when the new write is written to the previous position, it will not flush again. So when start_delalloc_inodes is executed, will_be_snapshotted is added, and all remaining dirty pages will be forced to turn to cow, all data is loss. e.g. 16G RAM, dirty ratio 20%, 16*0.2 = 3.2 GB data loss. Writing the inode that has been flushed will have the same problem. The inode that has been flushed will not be flushed for the second time. So you have better suggestions ? The behavior I changed was to increase will_be_snapshotted first, so the following write must have a reserve space, otherwise it must be returned to ENOSPC. And then go to flush data and flush the diry page with nocow, When all the dirty pages are written back, then switch to cow mode. And why did you wrote such a vague changelog? It misses all those important and subtle details of the change. Just checked the code and failure to allocate space during writeback after falling back to COW mode does indeed set AS_ENOSPC on the inode's mapping, which makes fsync return ENOSPC (through file_check_and_advance_wb_err() and filemap_check_wb_err()). Since fsync reports the error, I'm unsure to call it data loss but rather an optimization to avoid ENOSPC for nocow writes when running low on space. If you do not use fsync, you will not find the data loss. That's one of the reasons why fsync exists. I think that as long as the write returns successfully, the data must be written back to the disk at the end, otherwise it will be considered as data loss. No filesystem
Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
On 1.08.2018 11:08, Qu Wenruo wrote: > [BUG] > When mounting certain crafted image, btrfs will trigger kernel BUG_ON() > when try to recover balance: > -- > [ cut here ] > kernel BUG at fs/btrfs/extent-tree.c:8956! > invalid opcode: [#1] PREEMPT SMP NOPTI > CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 > RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] > RSP: 0018:b53540c9b890 EFLAGS: 00010202 > Call Trace: > walk_up_tree+0x172/0x1f0 [btrfs] > btrfs_drop_snapshot+0x3a4/0x830 [btrfs] > merge_reloc_roots+0xe1/0x1d0 [btrfs] > btrfs_recover_relocation+0x3ea/0x420 [btrfs] > open_ctree+0x1af3/0x1dd0 [btrfs] > btrfs_mount_root+0x66b/0x740 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > btrfs_mount+0x16d/0x890 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > do_mount+0x1fd/0xda0 > ksys_mount+0xba/0xd0 > __x64_sys_mount+0x21/0x30 > do_syscall_64+0x60/0x210 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > ---[ end trace d4344e4deee03435 ]--- > -- > > [CAUSE] > Another extent tree corruption. > > In this particular case, tree reloc root's owner is > DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is > corrupted and we failed the owner check in walk_up_tree(). > > [FIX] > It's pretty hard to take care of every extent tree corruption, but at > least we can remove such BUG_ON() and exit more gracefully. > > And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE > shares the same root (which is obviously invalid), we needs to make > __del_reloc_root() more robust to detect such invalid share to avoid > possible NULL dereference as root->node can be NULL in this case. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 > Reported-by: Xu Wen > Signed-off-by: Qu Wenruo > --- > As always, the patch is also pushed to my github repo, along with other > fuzzed images related fixes: > https://github.com/adam900710/linux/tree/tree_checker_enhance > (BTW, is it correct to indicate a branch like above?) > --- > fs/btrfs/extent-tree.c | 27 +++ > fs/btrfs/relocation.c | 2 +- > 2 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index da615ebc072e..5f4ca61348b5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct > btrfs_trans_handle *trans, > } > > if (eb == root->node) { > - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) > + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { > parent = eb->start; > - else > - BUG_ON(root->root_key.objectid != > -btrfs_header_owner(eb)); > + } else if (root->root_key.objectid != btrfs_header_owner(eb)) { > + btrfs_err_rl(fs_info, > + "unexpected tree owner, have %llu expect %llu", > + btrfs_header_owner(eb), > + root->root_key.objectid); > + return -EINVAL; EINVAL or ECLEANUP? > + } > } else { > - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) > + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { > parent = path->nodes[level + 1]->start; > - else > - BUG_ON(root->root_key.objectid != > -btrfs_header_owner(path->nodes[level + 1])); > + } else if (root->root_key.objectid != > +btrfs_header_owner(path->nodes[level + 1])) { > + btrfs_err_rl(fs_info, > + "unexpected tree owner, have %llu expect %llu", > + btrfs_header_owner(eb), > + root->root_key.objectid); > + return -EINVAL; ditto > + } > } > > btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1); > @@ -9020,6 +9029,8 @@ static noinline int walk_up_tree(struct > btrfs_trans_handle *trans, > ret = walk_up_proc(trans, root, path, wc); > if (ret > 0) > return 0; > + if (ret < 0) > + return ret; > > if (path->locks[level]) { > btrfs_tree_unlock_rw(path->nodes[level], > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index a2fc0bd83a40..c64051d33d05 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -1321,7 +1321,7 @@ static void __del_reloc_root(struct btrfs_root *root) > struct mapping_node *node = NULL; > struct reloc_control *rc = fs_info->reloc_ctl; > > - if (rc) { > + if (rc && root->node)
Re: BTRFS and databases
On 1 August 2018 at 04:45, MegaBrutal wrote: > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? > > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW > nature that is elsewhere a blessing, with databases it's a drawback). > But are there any advantages of still sticking to BTRFS for a database > albeit CoW is disabled, or should I just return to the old and > reliable ext4 for those applications? Also note that no data CoW implies no data checksums too. https://btrfs.wiki.kernel.org/index.php/FAQ#Can_I_have_nodatacow_.28or_chattr_.2BC.29_but_still_have_checksumming.3F Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and databases
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote: > I know it's a decade-old question, but I'd like to hear your thoughts > of today. By now, I became a heavy BTRFS user. Almost everywhere I use > BTRFS, except in situations when it is obvious there is no benefit > (e.g. /var/log, /boot). At home, all my desktop, laptop and server > computers are mainly running on BTRFS with only a few file systems on > ext4. I even installed BTRFS in corporate productive systems (in those > cases, the systems were mainly on ext4; but there were some specific > file systems those exploited BTRFS features). > > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? Personally, I'd start with btrfs with autodefrag. It has some degree of I/O overhead, but if the database isn't performance-critical and already near the limits of the hardware, it's unlikely to make much difference. Autodefrag should keep the fragmentation down to a minimum. Hugo. > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW > nature that is elsewhere a blessing, with databases it's a drawback). > But are there any advantages of still sticking to BTRFS for a database > albeit CoW is disabled, or should I just return to the old and > reliable ext4 for those applications? > > > Kind regards, > MegaBrutal -- Hugo Mills | In theory, theory and practice are the same. In hugo@... carfax.org.uk | practice, they're different. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: BTRFS and databases
MegaBrutal posted on Wed, 01 Aug 2018 05:45:15 +0200 as excerpted: > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? > > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW nature > that is elsewhere a blessing, with databases it's a drawback). But are > there any advantages of still sticking to BTRFS for a database albeit > CoW is disabled, or should I just return to the old and reliable ext4 > for those applications? Good question, on which I might expect some honest disagreement on the answer. Personally, I tend to hate nocow with a passion, and would thus recommend putting databases and similar write-pattern (VM images...) files on their own dedicated non-btrfs (ext4, etc) if at all reasonable. But that comes from a general split partition-favoring viewpoint, where doing another partition/lvm-volume and putting a different filesystem on it is no big deal, as it's just one more partition/volume to manage of (likely) several. Some distros/companies/installations have policies strongly favoring btrfs for its "storage pool" features, trying to keep things simple and flexible by using just the one solution and one big btrfs and throwing everything onto it, often using btrfs subvolumes where others would use separate partitions/volumes with independent filesystems. For these folks, the flexibility of being able to throw it all on one filesystem with subvolumes overrides the down sides of having to deal with nocow and its conditions, rules and additional risk. And a big part of that flexibility, along with being a feature in its own right, is btrfs built-in multi-device, without having to resort to an additional multi-device layer such as lvm or mdraid. So if you're using btrfs for multi-device or other features that nocow doesn't affect, it's plausible that you'd prefer nocow on btrfs to /having/ to do partitioning/lvm/mdraid and setup that separate non-btrfs just for your database (or vm image) files. But from your post you're perfectly fine with partitioning and the like already, and won't consider it a heavy imposition to deal with a separate non-btrfs, ext4 or whatever, and in that case, at least here, I'd strongly recommend you do just that, avoiding the nocow that I honestly see as a compromise best left to those that really need it because they aren't prepared to deal with the hassle of setting up the separate filesystem along with all that entails. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree
[BUG] When mounting certain crafted image, btrfs will trigger kernel BUG_ON() when try to recover balance: -- [ cut here ] kernel BUG at fs/btrfs/extent-tree.c:8956! invalid opcode: [#1] PREEMPT SMP NOPTI CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] RSP: 0018:b53540c9b890 EFLAGS: 00010202 Call Trace: walk_up_tree+0x172/0x1f0 [btrfs] btrfs_drop_snapshot+0x3a4/0x830 [btrfs] merge_reloc_roots+0xe1/0x1d0 [btrfs] btrfs_recover_relocation+0x3ea/0x420 [btrfs] open_ctree+0x1af3/0x1dd0 [btrfs] btrfs_mount_root+0x66b/0x740 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 btrfs_mount+0x16d/0x890 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 do_mount+0x1fd/0xda0 ksys_mount+0xba/0xd0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace d4344e4deee03435 ]--- -- [CAUSE] Another extent tree corruption. In this particular case, tree reloc root's owner is DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is corrupted and we failed the owner check in walk_up_tree(). [FIX] It's pretty hard to take care of every extent tree corruption, but at least we can remove such BUG_ON() and exit more gracefully. And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE shares the same root (which is obviously invalid), we needs to make __del_reloc_root() more robust to detect such invalid share to avoid possible NULL dereference as root->node can be NULL in this case. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 Reported-by: Xu Wen Signed-off-by: Qu Wenruo --- As always, the patch is also pushed to my github repo, along with other fuzzed images related fixes: https://github.com/adam900710/linux/tree/tree_checker_enhance (BTW, is it correct to indicate a branch like above?) --- fs/btrfs/extent-tree.c | 27 +++ fs/btrfs/relocation.c | 2 +- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index da615ebc072e..5f4ca61348b5 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans, } if (eb == root->node) { - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { parent = eb->start; - else - BUG_ON(root->root_key.objectid != - btrfs_header_owner(eb)); + } else if (root->root_key.objectid != btrfs_header_owner(eb)) { + btrfs_err_rl(fs_info, + "unexpected tree owner, have %llu expect %llu", +btrfs_header_owner(eb), +root->root_key.objectid); + return -EINVAL; + } } else { - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) { parent = path->nodes[level + 1]->start; - else - BUG_ON(root->root_key.objectid != - btrfs_header_owner(path->nodes[level + 1])); + } else if (root->root_key.objectid != + btrfs_header_owner(path->nodes[level + 1])) { + btrfs_err_rl(fs_info, + "unexpected tree owner, have %llu expect %llu", +btrfs_header_owner(eb), +root->root_key.objectid); + return -EINVAL; + } } btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1); @@ -9020,6 +9029,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans, ret = walk_up_proc(trans, root, path, wc); if (ret > 0) return 0; + if (ret < 0) + return ret; if (path->locks[level]) { btrfs_tree_unlock_rw(path->nodes[level], diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index a2fc0bd83a40..c64051d33d05 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1321,7 +1321,7 @@ static void __del_reloc_root(struct btrfs_root *root) struct mapping_node *node = NULL; struct reloc_control *rc = fs_info->reloc_ctl; - if (rc) { + if (rc && root->node) { spin_lock(&rc->reloc_root_tree.lock); rb_node = tree_search(&rc->reloc_root_tree.rb_root, root->node->start); -- 2.18.0 -- To unsubscribe from this
Re: csum failed on raid1 even after clean scrub?
Sterling Windmill posted on Mon, 30 Jul 2018 21:06:54 -0400 as excerpted: > Both drives are identical, Seagate 8TB external drives Are those the "shingled" SMR drives, normally sold as archive drives and first commonly available in the 8TB size, and often bought for their generally better price-per-TB without fully realizing the implications. There have been bugs regarding those drives in the past, and while I believe those bugs were fixed and AFAIK current status is no known SMR- specific bugs, they really are /not/ particularly suited to btrfs usage even for archiving, and definitely not to general usage (that is, pretty much anything but the straight-up archiving use-case they are sold for) use-cases. Of course USB connections are notorious for being unreliable in terms of btrfs usage as well, and I'd really hate to think what a combination of SMR on USB might wreak. If they're not SMR then carry-on! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html