Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression

2018-08-01 Thread Huang, Ying
"Huang, Ying"  writes:

> Hi, Chris,
>
> Chris Mason  writes:
>
>> On 19 Jun 2018, at 23:51, Huang, Ying wrote:
> "Huang, Ying"  writes:
>
>> Hi, Josef,
>>
>> Do you have time to take a look at the regression?
>>
>> kernel test robot  writes:
>>
>>> Greeting,
>>>
>>> FYI, we noticed a -12.3% regression of blogbench.write_score and
>>> a +9.6% improvement
>>> of blogbench.read_score due to commit:
>>>
>>>
>>> commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use
>>> sc->priority for slab shrink targets")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git
>>> master
>>>
>>> in testcase: blogbench
>>> on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @
>>> 2.10GHz with 8G memory
>>> with following parameters:
>>>
>>> disk: 1SSD
>>> fs: btrfs
>>> cpufreq_governor: performance
>>>
>>> test-description: Blogbench is a portable filesystem benchmark
>>> that tries to reproduce the load of a real-world busy file
>>> server.
>>> test-url:
>>
>> I'm surprised, this patch is a big win in production here at FB.  I'll
>> have to reproduce these results to better understand what is going on.
>> My first guess is that since we have fewer inodes in slab, we're
>> reading more inodes from disk in order to do the writes.
>>
>> But that should also make our read scores lower.
>
> Any update on this?

Ping.

Best Regards,
Huang, Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread robbieko

Filipe Manana 於 2018-08-02 01:55 寫到:
On Wed, Aug 1, 2018 at 1:54 PM, Filipe Manana  
wrote:
On Wed, Aug 1, 2018 at 11:20 AM, robbieko  
wrote:

Filipe Manana 於 2018-07-31 19:33 寫到:

On Tue, Jul 31, 2018 at 11:17 AM, robbieko  
wrote:


Filipe Manana 於 2018-07-30 20:34 寫到:

On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 


wrote:



On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 


wrote:



On Mon, Jul 30, 2018 at 11:21 AM, robbieko 


wrote:



From: Robbie Ko 

Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
modified the nocow writeback mechanism, if you create a 
snapshot,

it will always switch to cow writeback.

This will cause data loss when there is no space, because
when the space is full, the write will not reserve any space, 
only

check if it can be nocow write.




This is a bit vague.
You need to mention where space reservation does not happen (at 
the

time of the write syscall) and why,
and that the snapshot happens before flushing IO (running 
dealloc).

Then when running dealloc we fallback
to COW and fail.

You also need to tell that although the write syscall did not 
return

an error, the writeback will
fail but a subsequent fsync on the file will return an error 
(ENOSPC)

because the writeback set the error
on the inode's mapping, so it's not completely a silent data 
loss, as

for buffered writes there's no guarantee
that if write syscall returns 0 the data will be persisted
successfully (that can only be guaranteed if a subsequent
fsync call returns 0).



So fix this by first flush the nocow data, and then switch to 
the

cow write.




I'm also not seeing how what you've done is better then we have 
now

using the root->will_be_snapshotted atomic,
which is essentially used the same way as the new atomic you are
adding, and forces the writeback code no nocow
writes as well.




So what you have done can be made much more simple by flushing
delalloc before incrementing root->will_be_snapshotted instead of
after incrementing it:

https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX



There is no way to solve this problem in this modification.



It minimizes it. It only gives better guarantees that nocow buffered
writes that happened before calling the snapshot ioctl will not fall
back to cow,
not for the ones that happen while the call to the ioctl is 
happening.




When writing and create snapshot at the same time, the write will 
not

reserve space,
and will not return to ENOSPC, because will_be_snapshotted is still 
0.
So when writeback flush data, there will still be problems with 
ENOSPC.



Which is precisely what I proposed does without adding a new atomic
and more changes.
It flushes delalloc before incrementing root->will_be_snapshotted, 
so
that previous buffered nocow writes will not fallback to cow mode 
(and

require data space allocation).

It only leaves a very tiny and very unlikely to hit (but not
impossible) time window where nocow writes will fallback
to cow mode - after calling start_delalloc_inodes() and before
incrementing root->will_be_snapshotted a new buffered write can 
comes

in and gets immediately flushed
because someone called fsync() on the file or the VM decided to
trigger writeback (due to memory pressure or some other reason).



It is very easy to reproduce. Not a tiny time.
Because the time of start_delalloc_inodes will be very long.
When the dirty page is reduced, the new write can continue
to be written, and there is no reserve space.


I see now, thanks.


And these dirty pages are not necessarily flushed by
start_delalloc_inodes because start_delalloc_inodes is
checked from the front to the back, so when the new
write is written to the previous position, it will not flush again.

So when start_delalloc_inodes is executed, will_be_snapshotted is 
added,
and all remaining dirty pages will be forced to turn to cow, all data 
is

loss.
e.g. 16G RAM, dirty ratio 20%,  16*0.2 = 3.2 GB data loss.

Writing the inode that has been flushed will have the same problem.
The inode that has been flushed will not be flushed for the second 
time.


So you have better suggestions ?


Not efficient ones (introducing more locks).
This one if fine, but please write a changelog that mentions all these
details from your replies.
The original one does not explain in detail the problem nor the 
solution.

Some comments before incrementing both atomics and flushing delalloc
at ioctl.c would also be good to have.

And a test case for fstests too.


So the fstests test case I converted my testing script into a proper
patch for fstests:

https://friendpaste.com/1oKSUHoZjp9OZtxDcK1Mey



This test case is very good.


Once you get a better title for the patch, I'll update the test's
patch changelog  and submit it to fstests.
Fails with current btrfs and passes with your patch (and with earlier
proposal too).
Let me know if it's ok for you. Thanks.



I will rewrite the changelog and add comments,
then resend the patch v2.
And replace the titled to "Btrfs: fix 

Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

2018-08-01 Thread Al Viro
On Tue, Jul 31, 2018 at 10:51:57PM +, Mark Fasheh wrote:
> Hi Al,
> 
> On Tue, Jul 31, 2018 at 11:16:05PM +0100, Al Viro wrote:
> > On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote:
> > > ->getattr from inside the kernel won't always have a vfsmount, check for
> > > this.
> > 
> > Really?  Where would that happen?
> 
> It happens in my first patch. FWIW, I'm not tied to that particular bit of
> code, or even this (latest) approach. I would actually very much appreciate
> your input into how we might fix the problem we have where we are sometimes
> not exporting a correct ino/dev pair to user space.

Which callers would those be?  I don't see anything in your series that
wouldn't have vfsmount at hand...

> I have a good break-down of the problem and possible solutions here:
> 
> https://www.spinics.net/lists/linux-fsdevel/msg128003.html

btrfs pulling odd crap with device numbers is a problem, but this is far from
the most obvious unpleasantness caused by that - e.g. userland code expecting
that unequal st_dev for directory and subdirectory means that we have
a mountpoint there.  Or assuming it can compare that to st_dev of mountpoints
(or, indeed, the values in /proc/self/mountinfo) to find which fs it's on...

/proc/*/maps is unfortunate, but it does contain the pathnames, doesn't it?
What _real_ problems are there - the ones where we don't have a saner solution?
Note that /proc/locks is (a) FPOS interface and (b) unsuitable for your
approach anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] vfs: map unique ino/dev pairs for user space

2018-08-01 Thread J. R. Okajima
Mark Fasheh:
> The following patches fix these inconsistencies by introducing a VFS helper
> function which calls the underlying filesystem ->getattr to get our real
> inode number / device pair.  The returned values can then be used at those
> places in the kernel where we are incorrectly reporting our ino/dev pair. 
> We then update fs/proc/ and fs/locks.c to use this helper when writing to
> /proc/PID/maps and /proc/locks respectively.

I definitly agree that ino/dev pair should be a unique identity on the
system.  But I don't know why you are tryng to solve the problem in
generic VFS layer instead of the problematic FS.  Isn't it an
unnecessary overhead for many FS?
How about creating a new f_op member ->get_ino_dev(), ->show_identity()
or something, and implement the new f_op in the problematic FS only?
I hope it will be a lighter way to get the pair than generic getattr
way.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

2018-08-01 Thread Mark Fasheh
Hi Jeff,

On Wed, Aug 01, 2018 at 07:03:54AM -0400, Jeff Layton wrote:
> On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote:
> > /proc/locks does not always print the correct inode number/device pair.
> > Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> > unique values for userspace.
> > 
> > Signed-off-by: Mark Fasheh 
> > ---
> >  fs/locks.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index db7b6917d9c5..3a012df87fd8 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, 
> > struct file_lock *fl,
> > loff_t id, char *pfx)
> >  {
> > struct inode *inode = NULL;
> > +   struct dentry *dentry;
> > unsigned int fl_pid;
> > struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> >  
> > @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, 
> > struct file_lock *fl,
> > if (fl_pid == 0)
> > return;
> >  
> > -   if (fl->fl_file != NULL)
> > +   if (fl->fl_file != NULL) {
> > inode = locks_inode(fl->fl_file);
> > +   dentry = file_dentry(fl->fl_file);
> > +   }
> >  
> > seq_printf(f, "%lld:%s ", id, pfx);
> > if (IS_POSIX(fl)) {
> > @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, 
> > struct file_lock *fl,
> >: (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
> > }
> > if (inode) {
> > +   __u64 ino;
> > +   dev_t dev;
> > +
> > +   vfs_map_unique_ino_dev(dentry, , );
> 
> This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I
> don't think it'll be ok to call ->getattr while holding a spinlock.

Hmm, indeed it does call under spinlock. I'll hve to rethink how to approach
this in fs/locks.c

Thanks,
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Expected Received UUID

2018-08-01 Thread Gaurav Sanghavi
Hi,

Here are the kernel details (uname -a):

Device A (Debian 8.10):
Linux DeviceA 3.16.0-4-amd64 #1 SMP Debian 3.16.51-3 (2017-12-13) x86_64
GNU/Linux

Device B (Debian 8.11):
Linux DeviceB 3.16.0-4-amd64 #1 SMP Debian 3.16.51-3 (2017-12-13) x86_64
GNU/Linux

Device C (Debian 8.9):
Linux DeviceC 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15)
x86_64 GNU/Linux

All devices run Debian Jessie but I updated btrfs-tools and btrfs-progs on
devices B and C using the
stretch repository to see the Received UUID details ( I had not come across
python-btrfs earlier )

On Tue, Jul 31, 2018 at 4:37 PM, Hans van Kranenburg <
hans.van.kranenb...@mendix.com> wrote:

> Hi,
>
> Can you please report the kernel versions you are using on every system
> (uname -a)?
>
> I would indeed expect the Received UUID value for C to have the same
> uuid as the original UUID of A, so the b00e8ba1 one.
>
> The send part, generating the send stream is done by the running kernel.
> The receive part is done by the userspace btrfs progs. The btrfs progs
> receive code just sets the Received UUID to whatever it reads from the
> send stream information.
>
> So, your sending kernel version is important here.
>
> When looking up the lines that send the UUID, in fs/btrfs/send.c, I can
> see there was a problem like this in the past:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin
> ux.git/commit/?id=b96b1db039ebc584d03a9933b279e0d3e704c528
>
> I was introduced between linux 4.1 and 4.2 and solved in 2015 with that
> commit, which ended up somewhere in 4.3 I think.
>
> From the btrfs-progs versions you list, I suspect this is a Debian
> Jessie and 2x Debian Stretch, but the Debian 3.16 and 4.9 kernels would
> not should not have this issue.
>
> On 07/31/2018 07:42 PM, Gaurav Sanghavi wrote:
> > Hello everyone,
> >
> > While researching BTRFS for a project that involves backing up snapshots
> > from device A to device B
> > before consolidating backups from device B to device C, I noticed that
> > received UUID on snapshot on
> > device C is not the same as received UUID for the same snapshot on
> > device B. Here are my steps:
> >
> > 1. Device A
> > BTRFS version: v3.17
> >
> > btrfs su sn -r /home/test/snapshot1 /home/test/snaps/snapshot1
> > btrfs su show /home/test/snaps/snapshot1
> > Name: snapshot1
> > uuid: b00e8ba1-5aaa-3641-9c4c-e168eee5c296
> > Parent uuid: cb570dec-e9fd-1f40-99d2-2070c8ee2466
> > Received UUID:  ---
> > Creation time: 2018-07-30 18:32:37
> > Flags: readonly
> >
> > 2. Send to Device B
> > btrfs send /home/test/snaps/snapshot1 | ssh  'btrfs receive
> > /home/backups/'
> >
> > After send completes, on Device B
> > BTRFS version: v4.7.3
> > btrfs su show /home/backups/snapshot1
> > Name: snapshot1
> > UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c
> > Parent UUID: a2314f7c-4b11-ed40-901f-f1acb5ebf802
> > Received UUID: b00e8ba1-5aaa-3641-9c4c-e168eee5c296
> > Creation time: 2018-07-30 18:42:37 -0700
> > Flags: readonly
> >
> >
> > 3. Send to Device C
> > btrfs send /home/backups/snapshot1 | ssh  'btrfs receive
> > /home/backups2/'
> >
> > After send completes, on Device C
> > BTRFS version: v4.7.3
> > btrfs su show /home/backups2/snapshot1
> > Name: snapshot1
> > UUID: 8a13aab5-8e44-2541-9082-bc583933b964
> > Parent UUID: 54e9b4ff-46dc-534e-b70f-69eb7bb21028
> > Received UUID: 7c13d189-7fee-584e-ac90-e68cb0012f5c
> > Creation time: 2018-07-30 18:58:32 -0700
> > Flags: readonly
> >
> > 1. I have gone through some of the archived emails and have noticed
> > people mentioning that
> > if received UUID is set, btrfs send propogates this 'received UUID'. But
> > in above case,
> > it's different for the same snapshot on all three devices. Is this the
> > expected behavior ?
> >
> > 2. We want to be able to start backing up from Device A to C, in case B
> > goes down or needs
> > to be replaced. If received UUID is expected to differ for the snapshot
> > on device B and C, incremental
> > backups will not work from A to C without setting received UUID. I have
> > seen python-btrfs
> > mentioned in a couple of emails; but have anyone of you used it in a
> > production environment ?
> >
> > This is my first post to this email. Please let me know if I am missing
> > any details.
> >
> > Thanks,
> > Gaurav
> >
>
> --
> Hans van Kranenburg
>


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread Filipe Manana
On Wed, Aug 1, 2018 at 1:54 PM, Filipe Manana  wrote:
> On Wed, Aug 1, 2018 at 11:20 AM, robbieko  wrote:
>> Filipe Manana 於 2018-07-31 19:33 寫到:
>>
>>> On Tue, Jul 31, 2018 at 11:17 AM, robbieko  wrote:

 Filipe Manana 於 2018-07-30 20:34 寫到:

> On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
> wrote:
>>
>>
>> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
>> wrote:
>>>
>>>
>>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
>>> wrote:


 From: Robbie Ko 

 Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
 modified the nocow writeback mechanism, if you create a snapshot,
 it will always switch to cow writeback.

 This will cause data loss when there is no space, because
 when the space is full, the write will not reserve any space, only
 check if it can be nocow write.
>>>
>>>
>>>
>>> This is a bit vague.
>>> You need to mention where space reservation does not happen (at the
>>> time of the write syscall) and why,
>>> and that the snapshot happens before flushing IO (running dealloc).
>>> Then when running dealloc we fallback
>>> to COW and fail.
>>>
>>> You also need to tell that although the write syscall did not return
>>> an error, the writeback will
>>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>>> because the writeback set the error
>>> on the inode's mapping, so it's not completely a silent data loss, as
>>> for buffered writes there's no guarantee
>>> that if write syscall returns 0 the data will be persisted
>>> successfully (that can only be guaranteed if a subsequent
>>> fsync call returns 0).
>>>

 So fix this by first flush the nocow data, and then switch to the
 cow write.
>>
>>
>>
>> I'm also not seeing how what you've done is better then we have now
>> using the root->will_be_snapshotted atomic,
>> which is essentially used the same way as the new atomic you are
>> adding, and forces the writeback code no nocow
>> writes as well.
>
>
>
> So what you have done can be made much more simple by flushing
> delalloc before incrementing root->will_be_snapshotted instead of
> after incrementing it:
>
> https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX


 There is no way to solve this problem in this modification.
>>>
>>>
>>> It minimizes it. It only gives better guarantees that nocow buffered
>>> writes that happened before calling the snapshot ioctl will not fall
>>> back to cow,
>>> not for the ones that happen while the call to the ioctl is happening.
>>>

 When writing and create snapshot at the same time, the write will not
 reserve space,
 and will not return to ENOSPC, because will_be_snapshotted is still 0.
 So when writeback flush data, there will still be problems with ENOSPC.
>>>
>>>
>>> Which is precisely what I proposed does without adding a new atomic
>>> and more changes.
>>> It flushes delalloc before incrementing root->will_be_snapshotted, so
>>> that previous buffered nocow writes will not fallback to cow mode (and
>>> require data space allocation).
>>>
>>> It only leaves a very tiny and very unlikely to hit (but not
>>> impossible) time window where nocow writes will fallback
>>> to cow mode - after calling start_delalloc_inodes() and before
>>> incrementing root->will_be_snapshotted a new buffered write can comes
>>> in and gets immediately flushed
>>> because someone called fsync() on the file or the VM decided to
>>> trigger writeback (due to memory pressure or some other reason).
>>>
>>
>> It is very easy to reproduce. Not a tiny time.
>> Because the time of start_delalloc_inodes will be very long.
>> When the dirty page is reduced, the new write can continue
>> to be written, and there is no reserve space.
>
> I see now, thanks.
>>
>> And these dirty pages are not necessarily flushed by
>> start_delalloc_inodes because start_delalloc_inodes is
>> checked from the front to the back, so when the new
>> write is written to the previous position, it will not flush again.
>>
>> So when start_delalloc_inodes is executed, will_be_snapshotted is added,
>> and all remaining dirty pages will be forced to turn to cow, all data is
>> loss.
>> e.g. 16G RAM, dirty ratio 20%,  16*0.2 = 3.2 GB data loss.
>>
>> Writing the inode that has been flushed will have the same problem.
>> The inode that has been flushed will not be flushed for the second time.
>>
>> So you have better suggestions ?
>
> Not efficient ones (introducing more locks).
> This one if fine, but please write a changelog that mentions all these
> details from your replies.
> The original one does not explain in detail the problem nor the solution.
> Some comments before incrementing both atomics and flushing delalloc
> at ioctl.c would also be good to 

Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

2018-08-01 Thread Mark Fasheh
On Wed, Aug 01, 2018 at 08:37:31AM +0300, Amir Goldstein wrote:
> > if (inode) {
> > +   __u64 ino;
> > +   dev_t dev;
> > +
> > +   vfs_map_unique_ino_dev(dentry, , );
> > /* userspace relies on this representation of dev_t */
> > seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> > -   MAJOR(inode->i_sb->s_dev),
> > -   MINOR(inode->i_sb->s_dev), inode->i_ino);
> > +   MAJOR(dev), MINOR(dev), inode->i_ino);
> 
> Don't you mean ,ino); ?

Indeed I do, thanks for catching that one Amir.
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs

2018-08-01 Thread Mark Fasheh
Hi Amir,

On Wed, Aug 01, 2018 at 08:41:14AM +0300, Amir Goldstein wrote:
> > +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
> 
> I find this function name a bit more than function can guaranty.
> It's just a fancy wrapper around ->getattr()
> How about vfs_get_ino_dev() ?

Yeah I agree with that. An early version actually had the name
vfs_get_ino_dev(), I don't mind going back to it.

Thanks for the review!
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] btrfs: add helper btrfs_num_devices() to deduce num_devices

2018-08-01 Thread David Sterba
On Thu, Jul 26, 2018 at 02:53:34PM +0800, Anand Jain wrote:
> When the replace is running the fs_devices::num_devices also includes
> the replace device, however in some operations like device delete and
> balance it needs the actual num_devices without the repalce devices, so
> now the function btrfs_num_devices() just provides that.

The part how concurrent balance and dev-replace can be active at the
same time is missing. As this is not obvious, it's desired to have that
in the changelog.

If there are questions and comments in past patchset revisions, that's a
good material for changelogs and when you send an update you can enhance
the changelogs. I do simple fixups or additions but when you send a new
revision it the right time to save time on both sides.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-01 Thread Remi Gauvin
On 2018-07-31 11:45 PM, MegaBrutal wrote:

> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?
> 

Be very careful about nodatacow and btrfs 'raid'.  BTRFS has no data
synching mechanism for raid, so if your mirrors end up different
somehow, your Array is going to be inconsistent.
<>

Re: [PATCH v2 2/4] btrfs: fix race between free_stale_devices and close_fs_devices

2018-08-01 Thread David Sterba
On Thu, Jul 26, 2018 at 02:53:32PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> %fs_devices can be free-ed by btrfs_free_stale_devices() when the
> close_fs_devices() drops fs_devices::opened to zero, but close_fs_devices
> tries to access the %fs_devices again without the device_list_mutex.
> 
> Fix this by bringing the %fs_devices access with in the device_list_mutex.

AFAICS this cannot happen anymore because the two calls are serialized
by the uuid_mutex. But this was not the case when syzbot reported the
problem where your patch would apply.

The parallell access to opened and device list cannot happen when:

* btrfs_scan_one_device that wants to call btrfs_free_stale_devices
* btrfs_close_devices calls close_fs_devices

Fixed by the series:

btrfs: lift uuid_mutex to callers of btrfs_scan_one_device
btrfs: lift uuid_mutex to callers of btrfs_open_devices
btrfs: lift uuid_mutex to callers of btrfs_parse_early_options
btrfs: reorder initialization before the mount locks uuid_mutex
btrfs: fix mount and ioctl device scan ioctl race

If there's a race I don't see, please describe in more detail.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock

2018-08-01 Thread Qu Wenruo


On 2018年08月01日 21:27, David Sterba wrote:
> On Tue, Jul 31, 2018 at 02:52:23PM +0800, Qu Wenruo wrote:
>> [BUG]
>> For certains crafted image, whose csum root leaf has missing backref, if
>> we try to trigger write with data csum, it could cause deadlock with the
>> following kernel WARN_ON():
>> --
>> WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
>> CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
>> Workqueue: btrfs-endio-write btrfs_endio_write_helper
>> RIP: 0010:btrfs_tree_lock+0x3e2/0x400
>> Call Trace:
>>  btrfs_alloc_tree_block+0x39f/0x770
>>  __btrfs_cow_block+0x285/0x9e0
>>  btrfs_cow_block+0x191/0x2e0
>>  btrfs_search_slot+0x492/0x1160
>>  btrfs_lookup_csum+0xec/0x280
>>  btrfs_csum_file_blocks+0x2be/0xa60
>>  add_pending_csums+0xaf/0xf0
>>  btrfs_finish_ordered_io+0x74b/0xc90
>>  finish_ordered_fn+0x15/0x20
>>  normal_work_helper+0xf6/0x500
>>  btrfs_endio_write_helper+0x12/0x20
>>  process_one_work+0x302/0x770
>>  worker_thread+0x81/0x6d0
>>  kthread+0x180/0x1d0
>>  ret_from_fork+0x35/0x40
>> ---[ end trace 2e85051acb5f6dc1 ]---
>> --
>>
>> [CAUSE]
>> That crafted image has missing backref for csum tree root leaf.
>> And when we try to allocate new tree block, since there is no
>> EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
>> and use it.
> 
> I don't understand what exactly is the problem, can you please rephrase
> that? Ie. what exactly is missing from where. If the problem is on the
> higher level, regarding the csum tree, this could be checked much
> earlier. I don't really like to see each and every callsite of tree
> locks to require the error handling, so I'm exploring the possibilities
> if this is necessary at all.

This is a little complex.

I'll try to explain this using some asciiart:

Normal image  |   This fuzzed image
--+
BG 29360128   | BG 29360128
 One empty slot   |  One empty slot
29364224: backref to UUID tree| 29364224: backref to UUID tree
 Two empty slots  |  Two empty slots
29376512: backref to CSUM tree|  One empty slot (bad type)
29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
...   | ...

Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.

And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.

Then extent allocator gives a new extent bytenr 29376512 length 4096.
However bytenr 29376512 is our current csum root, and already locked
during CoW.

So in btrfs_alloc_tree_block() which calls btrfs_tree_lock() to lock the
newly allocated tree block, we are trying to lock the tree block we have
already locked, thus triggering the WARN_ON() and deadlock.


In this particular fuzzed image, we can do key type check to detect
early corruption, but it has the following problem:

1) Won't really solve the problem
   I could easily craft a image with backref for csum tree missing, and
   everything else is OK. It will still cause the problem.

2) May break later RO_compat features
   If later RO_compat feature introduces new key type in extent tree, in
   theory it should be able to be mounted RO, but will be rejected by
   the type check.

So we have to go the way used in the patch.

Another solution would be do backref check each time we try to read out
a tree block, but that's too costly and may screw up locking.

Thanks,
Qu

> 
>> However we have already locked csum tree root leaf by ourselves, thus we
>> will ourselves to release read/write lock, and causing deadlock.
>>
>> [FIX]
>> This patch will allow btrfs_tree_lock() to return error number, so
>> most callers can exit gracefully.
>> (Some caller doesn't really expect btrfs_tree_lock() to return error,
>> and in that case, we use BUG_ON())
>>
>> Since such problem can only happen in crafted image, we will still
>> trigger kernel warning, but with a little more meaningful warning
>> message.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
>> Reported-by: Xu Wen 
>> Signed-off-by: Qu Wenruo 
>> Reviewed-by: Su Yue 
>> ---
>> changelog:
>> v2:
>>   Fix pid_t output format from %llu to %d. Pointed by Su.
>>   Fix missing return 0 for btrfs_lock_tree().
>>   Update reviewed-by tags.
>> ---
>>  fs/btrfs/ctree.c   | 57 +++---
>>  fs/btrfs/extent-tree.c | 28 +++
>>  fs/btrfs/extent_io.c   |  8 --
>>  fs/btrfs/free-space-tree.c |  4 ++-
>>  fs/btrfs/locking.c | 13 +++--
>>  fs/btrfs/locking.h |  2 +-
>>  fs/btrfs/qgroup.c  |  4 ++-
>>  fs/btrfs/relocation.c  | 13 +++--
>>  fs/btrfs/tree-log.c| 14 --
>>  9 files changed, 115 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 

Re: [PATCH v2] btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock

2018-08-01 Thread David Sterba
On Tue, Jul 31, 2018 at 02:52:23PM +0800, Qu Wenruo wrote:
> [BUG]
> For certains crafted image, whose csum root leaf has missing backref, if
> we try to trigger write with data csum, it could cause deadlock with the
> following kernel WARN_ON():
> --
> WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
> CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
> Workqueue: btrfs-endio-write btrfs_endio_write_helper
> RIP: 0010:btrfs_tree_lock+0x3e2/0x400
> Call Trace:
>  btrfs_alloc_tree_block+0x39f/0x770
>  __btrfs_cow_block+0x285/0x9e0
>  btrfs_cow_block+0x191/0x2e0
>  btrfs_search_slot+0x492/0x1160
>  btrfs_lookup_csum+0xec/0x280
>  btrfs_csum_file_blocks+0x2be/0xa60
>  add_pending_csums+0xaf/0xf0
>  btrfs_finish_ordered_io+0x74b/0xc90
>  finish_ordered_fn+0x15/0x20
>  normal_work_helper+0xf6/0x500
>  btrfs_endio_write_helper+0x12/0x20
>  process_one_work+0x302/0x770
>  worker_thread+0x81/0x6d0
>  kthread+0x180/0x1d0
>  ret_from_fork+0x35/0x40
> ---[ end trace 2e85051acb5f6dc1 ]---
> --
> 
> [CAUSE]
> That crafted image has missing backref for csum tree root leaf.
> And when we try to allocate new tree block, since there is no
> EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
> and use it.

I don't understand what exactly is the problem, can you please rephrase
that? Ie. what exactly is missing from where. If the problem is on the
higher level, regarding the csum tree, this could be checked much
earlier. I don't really like to see each and every callsite of tree
locks to require the error handling, so I'm exploring the possibilities
if this is necessary at all.

> However we have already locked csum tree root leaf by ourselves, thus we
> will ourselves to release read/write lock, and causing deadlock.
> 
> [FIX]
> This patch will allow btrfs_tree_lock() to return error number, so
> most callers can exit gracefully.
> (Some caller doesn't really expect btrfs_tree_lock() to return error,
> and in that case, we use BUG_ON())
> 
> Since such problem can only happen in crafted image, we will still
> trigger kernel warning, but with a little more meaningful warning
> message.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> Reviewed-by: Su Yue 
> ---
> changelog:
> v2:
>   Fix pid_t output format from %llu to %d. Pointed by Su.
>   Fix missing return 0 for btrfs_lock_tree().
>   Update reviewed-by tags.
> ---
>  fs/btrfs/ctree.c   | 57 +++---
>  fs/btrfs/extent-tree.c | 28 +++
>  fs/btrfs/extent_io.c   |  8 --
>  fs/btrfs/free-space-tree.c |  4 ++-
>  fs/btrfs/locking.c | 13 +++--
>  fs/btrfs/locking.h |  2 +-
>  fs/btrfs/qgroup.c  |  4 ++-
>  fs/btrfs/relocation.c  | 13 +++--
>  fs/btrfs/tree-log.c| 14 --
>  9 files changed, 115 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 4bc326df472e..6e695f4cd931 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -161,10 +161,12 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
> *root)
>  struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root)
>  {
>   struct extent_buffer *eb;
> + int ret;
>  
>   while (1) {
>   eb = btrfs_root_node(root);
> - btrfs_tree_lock(eb);
> + ret = btrfs_tree_lock(eb);
> + BUG_ON(ret < 0);
>   if (eb == root->node)
>   break;
>   btrfs_tree_unlock(eb);
> @@ -1584,6 +1586,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
>   for (i = start_slot; i <= end_slot; i++) {
>   struct btrfs_key first_key;
>   int close = 1;
> + int ret;
>  
>   btrfs_node_key(parent, _key, i);
>   if (!progress_passed && comp_keys(_key, progress) < 0)
> @@ -1637,7 +1640,11 @@ int btrfs_realloc_node(struct btrfs_trans_handle 
> *trans,
>   if (search_start == 0)
>   search_start = last_block;
>  
> - btrfs_tree_lock(cur);
> + ret = btrfs_tree_lock(cur);
> + if (ret < 0) {
> + free_extent_buffer(cur);
> + return ret;
> + }
>   btrfs_set_lock_blocking(cur);
>   err = __btrfs_cow_block(trans, root, cur, parent, i,
>   , search_start,
> @@ -1853,7 +1860,11 @@ static noinline int balance_level(struct 
> btrfs_trans_handle *trans,
>   goto enospc;
>   }
>  
> - btrfs_tree_lock(child);
> + ret = btrfs_tree_lock(child);
> + if (ret < 0) {
> + free_extent_buffer(child);
> + goto enospc;
> + }
>   btrfs_set_lock_blocking(child);
>   ret = 

Re: [PATCH] btrfs: replace: Reset on-disk dev stats value after replace

2018-08-01 Thread David Sterba
On Tue, Jul 31, 2018 at 04:20:21PM +0900, Misono Tomohiro wrote:
> on-disk devs stats value is updated in btrfs_run_dev_stats(),
> which is called during commit transaction, if device->dev_stats_ccnt
> is not zero.
> 
> Since current replace operation does not touch dev_stats_ccnt,
> on-disk dev stats value is not updated. Therefore "btrfs device stats"
> may return old device's value after umount/mount
> (Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish).

In such cases it's good to have a snippet of the output, so it's
possible to match the output and check if we see the same thing.

Otherwise looks ok, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-08-01 Thread David Sterba
On Tue, Jul 24, 2018 at 10:35:28PM +0200, Adam Borowski wrote:
> On Mon, Jul 23, 2018 at 05:26:24PM +0200, David Sterba wrote:
> > On Wed, Jul 18, 2018 at 12:08:59AM +0200, Adam Borowski wrote:
> (Combined with as-folded)
> 
> | | btrfs: allow defrag on a file opened read-only that has rw permissions
> | |
> > > Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
> > > whenever you try to defrag a program that's currently being run, or
> > > causing intermittent exec failures on a live system being defragged.
> > > 
> > > As defrag doesn't change the file's contents in any way, there's no reason
> > > to consider it a rw operation.  Thus, let's check only whether the file
> > > could have been opened rw.  Such access control is still needed as
> > > currently defrag can use extra disk space, and might trigger bugs.
> <-
> | | We give EINVAL when the request is invalid; here it's ok but merely the
> | | user has insufficient privileges.  Thus, this return value reflects the
> | | error better -- as discussed in the identical case for dedupe.
> | |
> | | According to codesearch.debian.net, no userspace program distinguishes
> | | these values beyond strerror().
> | |
> | | Signed-off-by: Adam Borowski 
> | | Reviewed-by: David Sterba 
> | | [ fold the EPERM patch from Adam ]
> | | Signed-off-by: David Sterba 
> 
> [...]
> > So, I'll add the patch to 4.19 queue. It's small and isolated change so
> > a revert would be easy in case we find something bad. The 2nd patch
> > should be IMHO part of this change as it's logical to return the error
> > code in the patch that modifies the user visible behaviour.
> 
> A nitpick: the new commit message has a dangling pointer "this" to the title
> of the commit that was squashed.  It was:
> 
> | btrfs: defrag: return EPERM not EINVAL when only permissions fail
> 
> It'd be nice if it could be inserted in some form in the place I marked with
> an arrow.

Right. I've replaced 'this' with 'EPERM', thanks for catching it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree.c: Remove unused __btrfs_free_block_rsv()

2018-08-01 Thread David Sterba
On Thu, Jul 26, 2018 at 11:40:54AM +0900, Misono Tomohiro wrote:
> There is no user of this function.
> 
> This is forgotten to get removed in commit a575ceeb1338
> ("Btrfs: get rid of unused orphan infrastructure").
> 
> Signed-off-by: Misono Tomohiro 

Added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: backref.c: Use ERR_CAST() to return error code

2018-08-01 Thread David Sterba
On Thu, Jul 26, 2018 at 10:22:58AM +0900, Misono Tomohiro wrote:
> Use ERR_CAST() instead of void * to make meaning clear.
> 
> Signed-off-by: Misono Tomohiro 

Added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread Filipe Manana
On Wed, Aug 1, 2018 at 11:20 AM, robbieko  wrote:
> Filipe Manana 於 2018-07-31 19:33 寫到:
>
>> On Tue, Jul 31, 2018 at 11:17 AM, robbieko  wrote:
>>>
>>> Filipe Manana 於 2018-07-30 20:34 寫到:
>>>
 On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
 wrote:
>
>
> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
> wrote:
>>
>>
>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
>> wrote:
>>>
>>>
>>> From: Robbie Ko 
>>>
>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> modified the nocow writeback mechanism, if you create a snapshot,
>>> it will always switch to cow writeback.
>>>
>>> This will cause data loss when there is no space, because
>>> when the space is full, the write will not reserve any space, only
>>> check if it can be nocow write.
>>
>>
>>
>> This is a bit vague.
>> You need to mention where space reservation does not happen (at the
>> time of the write syscall) and why,
>> and that the snapshot happens before flushing IO (running dealloc).
>> Then when running dealloc we fallback
>> to COW and fail.
>>
>> You also need to tell that although the write syscall did not return
>> an error, the writeback will
>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>> because the writeback set the error
>> on the inode's mapping, so it's not completely a silent data loss, as
>> for buffered writes there's no guarantee
>> that if write syscall returns 0 the data will be persisted
>> successfully (that can only be guaranteed if a subsequent
>> fsync call returns 0).
>>
>>>
>>> So fix this by first flush the nocow data, and then switch to the
>>> cow write.
>
>
>
> I'm also not seeing how what you've done is better then we have now
> using the root->will_be_snapshotted atomic,
> which is essentially used the same way as the new atomic you are
> adding, and forces the writeback code no nocow
> writes as well.



 So what you have done can be made much more simple by flushing
 delalloc before incrementing root->will_be_snapshotted instead of
 after incrementing it:

 https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX
>>>
>>>
>>> There is no way to solve this problem in this modification.
>>
>>
>> It minimizes it. It only gives better guarantees that nocow buffered
>> writes that happened before calling the snapshot ioctl will not fall
>> back to cow,
>> not for the ones that happen while the call to the ioctl is happening.
>>
>>>
>>> When writing and create snapshot at the same time, the write will not
>>> reserve space,
>>> and will not return to ENOSPC, because will_be_snapshotted is still 0.
>>> So when writeback flush data, there will still be problems with ENOSPC.
>>
>>
>> Which is precisely what I proposed does without adding a new atomic
>> and more changes.
>> It flushes delalloc before incrementing root->will_be_snapshotted, so
>> that previous buffered nocow writes will not fallback to cow mode (and
>> require data space allocation).
>>
>> It only leaves a very tiny and very unlikely to hit (but not
>> impossible) time window where nocow writes will fallback
>> to cow mode - after calling start_delalloc_inodes() and before
>> incrementing root->will_be_snapshotted a new buffered write can comes
>> in and gets immediately flushed
>> because someone called fsync() on the file or the VM decided to
>> trigger writeback (due to memory pressure or some other reason).
>>
>
> It is very easy to reproduce. Not a tiny time.
> Because the time of start_delalloc_inodes will be very long.
> When the dirty page is reduced, the new write can continue
> to be written, and there is no reserve space.

I see now, thanks.
>
> And these dirty pages are not necessarily flushed by
> start_delalloc_inodes because start_delalloc_inodes is
> checked from the front to the back, so when the new
> write is written to the previous position, it will not flush again.
>
> So when start_delalloc_inodes is executed, will_be_snapshotted is added,
> and all remaining dirty pages will be forced to turn to cow, all data is
> loss.
> e.g. 16G RAM, dirty ratio 20%,  16*0.2 = 3.2 GB data loss.
>
> Writing the inode that has been flushed will have the same problem.
> The inode that has been flushed will not be flushed for the second time.
>
> So you have better suggestions ?

Not efficient ones (introducing more locks).
This one if fine, but please write a changelog that mentions all these
details from your replies.
The original one does not explain in detail the problem nor the solution.
Some comments before incrementing both atomics and flushing delalloc
at ioctl.c would also be good to have.

And a test case for fstests too.

Thanks.

>
>
>>>
>>> The behavior I changed was to increase will_be_snapshotted first,
>>> so the following write must have a reserve space,
>>> 

Re: [PATCH 0/7] Some unused parameter cleanup

2018-08-01 Thread David Sterba
On Wed, Aug 01, 2018 at 11:32:24AM +0800, Lu Fengqi wrote:
> PATCH 1-2 remove all unused fs_info parameter from delayed-inode.c
> PATCH 3-5 remove all unused fs_info parameter from root-tree.c
> PATCH 6-7 some cleanup for btrfs_unlink_subvol

Thanks, added to misc-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-01 Thread Qu Wenruo



On 2018年08月01日 20:12, Nikolay Borisov wrote:
> 
> 
> On  1.08.2018 14:13, Qu Wenruo wrote:
>>
>>
>> On 2018年08月01日 18:08, Nikolay Borisov wrote:
>>>
>>>
>>> On  1.08.2018 11:08, Qu Wenruo wrote:
 [BUG]
 When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
 when try to recover balance:
 --
 [ cut here ]
 kernel BUG at fs/btrfs/extent-tree.c:8956!
 invalid opcode:  [#1] PREEMPT SMP NOPTI
 CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
 RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
 RSP: 0018:b53540c9b890 EFLAGS: 00010202
 Call Trace:
  walk_up_tree+0x172/0x1f0 [btrfs]
  btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
  merge_reloc_roots+0xe1/0x1d0 [btrfs]
  btrfs_recover_relocation+0x3ea/0x420 [btrfs]
  open_ctree+0x1af3/0x1dd0 [btrfs]
  btrfs_mount_root+0x66b/0x740 [btrfs]
  mount_fs+0x3b/0x16a
  vfs_kern_mount.part.9+0x54/0x140
  btrfs_mount+0x16d/0x890 [btrfs]
  mount_fs+0x3b/0x16a
  vfs_kern_mount.part.9+0x54/0x140
  do_mount+0x1fd/0xda0
  ksys_mount+0xba/0xd0
  __x64_sys_mount+0x21/0x30
  do_syscall_64+0x60/0x210
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 ---[ end trace d4344e4deee03435 ]---
 --

 [CAUSE]
 Another extent tree corruption.

 In this particular case, tree reloc root's owner is
 DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is
 corrupted and we failed the owner check in walk_up_tree().

 [FIX]
 It's pretty hard to take care of every extent tree corruption, but at
 least we can remove such BUG_ON() and exit more gracefully.

 And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE
 shares the same root (which is obviously invalid), we needs to make
 __del_reloc_root() more robust to detect such invalid share to avoid
 possible NULL dereference as root->node can be NULL in this case.

 Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
 Reported-by: Xu Wen 
 Signed-off-by: Qu Wenruo 
 ---
 As always, the patch is also pushed to my github repo, along with other
 fuzzed images related fixes:
 https://github.com/adam900710/linux/tree/tree_checker_enhance
 (BTW, is it correct to indicate a branch like above?)
 ---
  fs/btrfs/extent-tree.c | 27 +++
  fs/btrfs/relocation.c  |  2 +-
  2 files changed, 20 insertions(+), 9 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index da615ebc072e..5f4ca61348b5 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct 
 btrfs_trans_handle *trans,
}
  
if (eb == root->node) {
 -  if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
 +  if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
parent = eb->start;
 -  else
 -  BUG_ON(root->root_key.objectid !=
 - btrfs_header_owner(eb));
 +  } else if (root->root_key.objectid != btrfs_header_owner(eb)) {
 +  btrfs_err_rl(fs_info,
 +  "unexpected tree owner, have %llu expect %llu",
 +   btrfs_header_owner(eb),
 +   root->root_key.objectid);
 +  return -EINVAL;
>>>
>>> EINVAL or ECLEANUP?
>>
>> Yep, also my concern here.
>>
>> I have no bias here, and both makes its sense here.
>>
>> EUCLEAN means it's something unexpected, but normally it's used in
>> static check, no sure if it suits for runtime check.
> 
> My thinking goes if something is an on-disk error (and fuzzed images
> fall in that category) then we should return EUCLEAN. If the owner can
> be mismatched only as a result of erroneous data on-disk which is then
> read and subsequently this code triggers then it's something induced due
> to an on-disk error.

Makes sense.

Does it also mean later BUG_ON() convert would also use EUCLEAN as most
BUG_ON() is either some real bug or corrupted/fuzzed images?

Thanks,
Qu

> 
>>
>> Although EINVAL looks more suitable for runtime error, it is not a
>> perfect errno either, as it's not something invalid from user, but the
>> fs has something unexpected.
>>
>> I'm all ears on this errno issue.
>>
>> Thanks,
>> Qu
>>
>>>
 +  }
} else {
 -  if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
 +  if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
parent = path->nodes[level + 1]->start;
 -  else
 -  BUG_ON(root->root_key.objectid !=
 - btrfs_header_owner(path->nodes[level + 1]));
 +  } else if (root->root_key.objectid !=
 

Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-01 Thread Nikolay Borisov



On  1.08.2018 14:13, Qu Wenruo wrote:
> 
> 
> On 2018年08月01日 18:08, Nikolay Borisov wrote:
>>
>>
>> On  1.08.2018 11:08, Qu Wenruo wrote:
>>> [BUG]
>>> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
>>> when try to recover balance:
>>> --
>>> [ cut here ]
>>> kernel BUG at fs/btrfs/extent-tree.c:8956!
>>> invalid opcode:  [#1] PREEMPT SMP NOPTI
>>> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
>>> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
>>> RSP: 0018:b53540c9b890 EFLAGS: 00010202
>>> Call Trace:
>>>  walk_up_tree+0x172/0x1f0 [btrfs]
>>>  btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
>>>  merge_reloc_roots+0xe1/0x1d0 [btrfs]
>>>  btrfs_recover_relocation+0x3ea/0x420 [btrfs]
>>>  open_ctree+0x1af3/0x1dd0 [btrfs]
>>>  btrfs_mount_root+0x66b/0x740 [btrfs]
>>>  mount_fs+0x3b/0x16a
>>>  vfs_kern_mount.part.9+0x54/0x140
>>>  btrfs_mount+0x16d/0x890 [btrfs]
>>>  mount_fs+0x3b/0x16a
>>>  vfs_kern_mount.part.9+0x54/0x140
>>>  do_mount+0x1fd/0xda0
>>>  ksys_mount+0xba/0xd0
>>>  __x64_sys_mount+0x21/0x30
>>>  do_syscall_64+0x60/0x210
>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> ---[ end trace d4344e4deee03435 ]---
>>> --
>>>
>>> [CAUSE]
>>> Another extent tree corruption.
>>>
>>> In this particular case, tree reloc root's owner is
>>> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is
>>> corrupted and we failed the owner check in walk_up_tree().
>>>
>>> [FIX]
>>> It's pretty hard to take care of every extent tree corruption, but at
>>> least we can remove such BUG_ON() and exit more gracefully.
>>>
>>> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE
>>> shares the same root (which is obviously invalid), we needs to make
>>> __del_reloc_root() more robust to detect such invalid share to avoid
>>> possible NULL dereference as root->node can be NULL in this case.
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
>>> Reported-by: Xu Wen 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> As always, the patch is also pushed to my github repo, along with other
>>> fuzzed images related fixes:
>>> https://github.com/adam900710/linux/tree/tree_checker_enhance
>>> (BTW, is it correct to indicate a branch like above?)
>>> ---
>>>  fs/btrfs/extent-tree.c | 27 +++
>>>  fs/btrfs/relocation.c  |  2 +-
>>>  2 files changed, 20 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index da615ebc072e..5f4ca61348b5 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct 
>>> btrfs_trans_handle *trans,
>>> }
>>>  
>>> if (eb == root->node) {
>>> -   if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
>>> +   if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>>> parent = eb->start;
>>> -   else
>>> -   BUG_ON(root->root_key.objectid !=
>>> -  btrfs_header_owner(eb));
>>> +   } else if (root->root_key.objectid != btrfs_header_owner(eb)) {
>>> +   btrfs_err_rl(fs_info,
>>> +   "unexpected tree owner, have %llu expect %llu",
>>> +btrfs_header_owner(eb),
>>> +root->root_key.objectid);
>>> +   return -EINVAL;
>>
>> EINVAL or ECLEANUP?
> 
> Yep, also my concern here.
> 
> I have no bias here, and both makes its sense here.
> 
> EUCLEAN means it's something unexpected, but normally it's used in
> static check, no sure if it suits for runtime check.

My thinking goes if something is an on-disk error (and fuzzed images
fall in that category) then we should return EUCLEAN. If the owner can
be mismatched only as a result of erroneous data on-disk which is then
read and subsequently this code triggers then it's something induced due
to an on-disk error.

> 
> Although EINVAL looks more suitable for runtime error, it is not a
> perfect errno either, as it's not something invalid from user, but the
> fs has something unexpected.
> 
> I'm all ears on this errno issue.
> 
> Thanks,
> Qu
> 
>>
>>> +   }
>>> } else {
>>> -   if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
>>> +   if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>>> parent = path->nodes[level + 1]->start;
>>> -   else
>>> -   BUG_ON(root->root_key.objectid !=
>>> -  btrfs_header_owner(path->nodes[level + 1]));
>>> +   } else if (root->root_key.objectid !=
>>> +  btrfs_header_owner(path->nodes[level + 1])) {
>>> +   btrfs_err_rl(fs_info,
>>> +   "unexpected tree owner, have %llu expect %llu",
>>> +btrfs_header_owner(eb),
>>> +root->root_key.objectid);

Re: Unmountable root partition

2018-08-01 Thread Cerem Cem ASLAN
Yes, command output is as is, because I just copied and pasted into the
mail. When I omit the `-t btrfs` part, result is the same.

I'm now trying to rescue what I can, so getting a image dump with
`ddrescue`. It's read about 25% without any errors but it will be expected
to finish in 6 hours. Then I'll try btrfscue to see what happens.

I know it's totally my fault because I must be ready for a total
disk/pc/building burn out. Lessons learned.

2018-08-01 8:32 GMT+03:00 Chris Murphy :

> On Tue, Jul 31, 2018 at 12:03 PM, Cerem Cem ASLAN 
> wrote:
>
> > 3. mount -t btrfs /dev/mapper/foo--vg-root /mnt/foo
> > Gives the following error:
> >
> > mount: wrong fs type, bad option, bad superblock on ...
> >
> > 4. dmesg | tail
> > Outputs the following:
> >
> >
> > [17755.840916] sd 3:0:0:0: [sda] tag#0 FAILED Result:
> > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> > [17755.840919] sd 3:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 07 c0 02
> 00 00
> > 02 00
> > [17755.840921] blk_update_request: I/O error, dev sda, sector 507906
> > [17755.840941] EXT4-fs (dm-4): unable to read superblock
>
>
> Are you sure this is the output for the command? Because you're
> explicitly asking for type btrfs, which fails, and then the kernel
> reports EXT4 superblock unreadable. What do you get if you omit -t
> btrfs and just let it autodetect?
>
> But yeah, this is an IO error from the device and there's nothing
> Btrfs can do about that unless there is DUP or raid1+ metadata
> available.
>
> Is it possible this LV was accidentally reformatted ext4?
>
>
> --
> Chris Murphy
>


Re: BTRFS and databases

2018-08-01 Thread Adam Borowski
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

Is this database performance-critical?

If yes, you'd want ext4 -- nocow is a crappy ext4 lookalike, with no
benefits of btrfs.  Or, if you snapshot it, you get bad fragmentation yet no
checksums/etc.

If no, regular cow (especially with autodefrag) will be enough.  Sure, this
particular load won't be as performant (mysql really loves fsync, which is
an anathema to btrfs), but you get all the data safety improvements,
frequent cheap backups, and so on.

Thus: if the server's primary purpose is that database, you don't want
btrfs.  If the database is merely incidental, not microoptimizing it will
save a lot of your time.

In neither case nocow is a good idea.  Especially if raid (!= 0) is
involved.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-01 Thread Qu Wenruo



On 2018年08月01日 18:08, Nikolay Borisov wrote:
> 
> 
> On  1.08.2018 11:08, Qu Wenruo wrote:
>> [BUG]
>> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
>> when try to recover balance:
>> --
>> [ cut here ]
>> kernel BUG at fs/btrfs/extent-tree.c:8956!
>> invalid opcode:  [#1] PREEMPT SMP NOPTI
>> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
>> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
>> RSP: 0018:b53540c9b890 EFLAGS: 00010202
>> Call Trace:
>>  walk_up_tree+0x172/0x1f0 [btrfs]
>>  btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
>>  merge_reloc_roots+0xe1/0x1d0 [btrfs]
>>  btrfs_recover_relocation+0x3ea/0x420 [btrfs]
>>  open_ctree+0x1af3/0x1dd0 [btrfs]
>>  btrfs_mount_root+0x66b/0x740 [btrfs]
>>  mount_fs+0x3b/0x16a
>>  vfs_kern_mount.part.9+0x54/0x140
>>  btrfs_mount+0x16d/0x890 [btrfs]
>>  mount_fs+0x3b/0x16a
>>  vfs_kern_mount.part.9+0x54/0x140
>>  do_mount+0x1fd/0xda0
>>  ksys_mount+0xba/0xd0
>>  __x64_sys_mount+0x21/0x30
>>  do_syscall_64+0x60/0x210
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> ---[ end trace d4344e4deee03435 ]---
>> --
>>
>> [CAUSE]
>> Another extent tree corruption.
>>
>> In this particular case, tree reloc root's owner is
>> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is
>> corrupted and we failed the owner check in walk_up_tree().
>>
>> [FIX]
>> It's pretty hard to take care of every extent tree corruption, but at
>> least we can remove such BUG_ON() and exit more gracefully.
>>
>> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE
>> shares the same root (which is obviously invalid), we needs to make
>> __del_reloc_root() more robust to detect such invalid share to avoid
>> possible NULL dereference as root->node can be NULL in this case.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
>> Reported-by: Xu Wen 
>> Signed-off-by: Qu Wenruo 
>> ---
>> As always, the patch is also pushed to my github repo, along with other
>> fuzzed images related fixes:
>> https://github.com/adam900710/linux/tree/tree_checker_enhance
>> (BTW, is it correct to indicate a branch like above?)
>> ---
>>  fs/btrfs/extent-tree.c | 27 +++
>>  fs/btrfs/relocation.c  |  2 +-
>>  2 files changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index da615ebc072e..5f4ca61348b5 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct 
>> btrfs_trans_handle *trans,
>>  }
>>  
>>  if (eb == root->node) {
>> -if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
>> +if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>>  parent = eb->start;
>> -else
>> -BUG_ON(root->root_key.objectid !=
>> -   btrfs_header_owner(eb));
>> +} else if (root->root_key.objectid != btrfs_header_owner(eb)) {
>> +btrfs_err_rl(fs_info,
>> +"unexpected tree owner, have %llu expect %llu",
>> + btrfs_header_owner(eb),
>> + root->root_key.objectid);
>> +return -EINVAL;
> 
> EINVAL or ECLEANUP?

Yep, also my concern here.

I have no bias here, and both makes its sense here.

EUCLEAN means it's something unexpected, but normally it's used in
static check, no sure if it suits for runtime check.

Although EINVAL looks more suitable for runtime error, it is not a
perfect errno either, as it's not something invalid from user, but the
fs has something unexpected.

I'm all ears on this errno issue.

Thanks,
Qu

> 
>> +}
>>  } else {
>> -if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
>> +if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>>  parent = path->nodes[level + 1]->start;
>> -else
>> -BUG_ON(root->root_key.objectid !=
>> -   btrfs_header_owner(path->nodes[level + 1]));
>> +} else if (root->root_key.objectid !=
>> +   btrfs_header_owner(path->nodes[level + 1])) {
>> +btrfs_err_rl(fs_info,
>> +"unexpected tree owner, have %llu expect %llu",
>> + btrfs_header_owner(eb),
>> + root->root_key.objectid);
>> +return -EINVAL;
> ditto
>> +}
>>  }
>>  
>>  btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1);
>> @@ -9020,6 +9029,8 @@ static noinline int walk_up_tree(struct 
>> btrfs_trans_handle *trans,
>>  ret = walk_up_proc(trans, root, path, wc);
>>  if (ret > 0)
>>  return 0;
>> +if (ret < 0)
>> +   

Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

2018-08-01 Thread Jeff Layton
On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote:
> /proc/locks does not always print the correct inode number/device pair.
> Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> unique values for userspace.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/locks.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..3a012df87fd8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   loff_t id, char *pfx)
>  {
>   struct inode *inode = NULL;
> + struct dentry *dentry;
>   unsigned int fl_pid;
>   struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>  
> @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct 
> file_lock *fl,
>   if (fl_pid == 0)
>   return;
>  
> - if (fl->fl_file != NULL)
> + if (fl->fl_file != NULL) {
>   inode = locks_inode(fl->fl_file);
> + dentry = file_dentry(fl->fl_file);
> + }
>  
>   seq_printf(f, "%lld:%s ", id, pfx);
>   if (IS_POSIX(fl)) {
> @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, 
> struct file_lock *fl,
>  : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
>   }
>   if (inode) {
> + __u64 ino;
> + dev_t dev;
> +
> + vfs_map_unique_ino_dev(dentry, , );

This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I
don't think it'll be ok to call ->getattr while holding a spinlock.

>   /* userspace relies on this representation of dev_t */
>   seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> - MAJOR(inode->i_sb->s_dev),
> - MINOR(inode->i_sb->s_dev), inode->i_ino);
> + MAJOR(dev), MINOR(dev), inode->i_ino);
>   } else {
>   seq_printf(f, "%d :0 ", fl_pid);
>   }

-- 
Jeff Layton 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread robbieko

Filipe Manana 於 2018-07-31 19:33 寫到:
On Tue, Jul 31, 2018 at 11:17 AM, robbieko  
wrote:

Filipe Manana 於 2018-07-30 20:34 寫到:


On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
wrote:


On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
wrote:


On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
wrote:


From: Robbie Ko 

Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
modified the nocow writeback mechanism, if you create a snapshot,
it will always switch to cow writeback.

This will cause data loss when there is no space, because
when the space is full, the write will not reserve any space, only
check if it can be nocow write.



This is a bit vague.
You need to mention where space reservation does not happen (at the
time of the write syscall) and why,
and that the snapshot happens before flushing IO (running dealloc).
Then when running dealloc we fallback
to COW and fail.

You also need to tell that although the write syscall did not 
return

an error, the writeback will
fail but a subsequent fsync on the file will return an error 
(ENOSPC)

because the writeback set the error
on the inode's mapping, so it's not completely a silent data loss, 
as

for buffered writes there's no guarantee
that if write syscall returns 0 the data will be persisted
successfully (that can only be guaranteed if a subsequent
fsync call returns 0).



So fix this by first flush the nocow data, and then switch to the
cow write.



I'm also not seeing how what you've done is better then we have now
using the root->will_be_snapshotted atomic,
which is essentially used the same way as the new atomic you are
adding, and forces the writeback code no nocow
writes as well.



So what you have done can be made much more simple by flushing
delalloc before incrementing root->will_be_snapshotted instead of
after incrementing it:

https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX


There is no way to solve this problem in this modification.


It minimizes it. It only gives better guarantees that nocow buffered
writes that happened before calling the snapshot ioctl will not fall
back to cow,
not for the ones that happen while the call to the ioctl is happening.



When writing and create snapshot at the same time, the write will not
reserve space,
and will not return to ENOSPC, because will_be_snapshotted is still 0.
So when writeback flush data, there will still be problems with 
ENOSPC.


Which is precisely what I proposed does without adding a new atomic
and more changes.
It flushes delalloc before incrementing root->will_be_snapshotted, so
that previous buffered nocow writes will not fallback to cow mode (and
require data space allocation).

It only leaves a very tiny and very unlikely to hit (but not
impossible) time window where nocow writes will fallback
to cow mode - after calling start_delalloc_inodes() and before
incrementing root->will_be_snapshotted a new buffered write can comes
in and gets immediately flushed
because someone called fsync() on the file or the VM decided to
trigger writeback (due to memory pressure or some other reason).



It is very easy to reproduce. Not a tiny time.
Because the time of start_delalloc_inodes will be very long.
When the dirty page is reduced, the new write can continue
to be written, and there is no reserve space.

And these dirty pages are not necessarily flushed by
start_delalloc_inodes because start_delalloc_inodes is
checked from the front to the back, so when the new
write is written to the previous position, it will not flush again.

So when start_delalloc_inodes is executed, will_be_snapshotted is added,
and all remaining dirty pages will be forced to turn to cow, all data is 
loss.

e.g. 16G RAM, dirty ratio 20%,  16*0.2 = 3.2 GB data loss.

Writing the inode that has been flushed will have the same problem.
The inode that has been flushed will not be flushed for the second time.

So you have better suggestions ?



The behavior I changed was to increase will_be_snapshotted first,
so the following write must have a reserve space,
otherwise it must be returned to ENOSPC.
And then go to flush data and flush the diry page with nocow,
When all the dirty pages are written back, then switch to cow mode.


And why did you wrote such a vague changelog? It misses all those
important and subtle details of the change.





Just checked the code and failure to allocate space during writeback
after falling back to COW mode does indeed set
AS_ENOSPC on the inode's mapping, which makes fsync return ENOSPC
(through file_check_and_advance_wb_err()
and filemap_check_wb_err()).

Since fsync reports the error, I'm unsure to call it data loss but
rather an optimization to avoid ENOSPC for nocow writes when running
low on space.



If you do not use fsync, you will not find the data loss.


That's one of the reasons why fsync exists.


I think that as long as the write returns successfully,
the data must be written back to the disk at the end,
otherwise it will be considered as data loss.


No 

Re: [PATCH 1/1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-01 Thread Nikolay Borisov



On  1.08.2018 11:08, Qu Wenruo wrote:
> [BUG]
> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
> when try to recover balance:
> --
> [ cut here ]
> kernel BUG at fs/btrfs/extent-tree.c:8956!
> invalid opcode:  [#1] PREEMPT SMP NOPTI
> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
> RSP: 0018:b53540c9b890 EFLAGS: 00010202
> Call Trace:
>  walk_up_tree+0x172/0x1f0 [btrfs]
>  btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
>  merge_reloc_roots+0xe1/0x1d0 [btrfs]
>  btrfs_recover_relocation+0x3ea/0x420 [btrfs]
>  open_ctree+0x1af3/0x1dd0 [btrfs]
>  btrfs_mount_root+0x66b/0x740 [btrfs]
>  mount_fs+0x3b/0x16a
>  vfs_kern_mount.part.9+0x54/0x140
>  btrfs_mount+0x16d/0x890 [btrfs]
>  mount_fs+0x3b/0x16a
>  vfs_kern_mount.part.9+0x54/0x140
>  do_mount+0x1fd/0xda0
>  ksys_mount+0xba/0xd0
>  __x64_sys_mount+0x21/0x30
>  do_syscall_64+0x60/0x210
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ---[ end trace d4344e4deee03435 ]---
> --
> 
> [CAUSE]
> Another extent tree corruption.
> 
> In this particular case, tree reloc root's owner is
> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is
> corrupted and we failed the owner check in walk_up_tree().
> 
> [FIX]
> It's pretty hard to take care of every extent tree corruption, but at
> least we can remove such BUG_ON() and exit more gracefully.
> 
> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE
> shares the same root (which is obviously invalid), we needs to make
> __del_reloc_root() more robust to detect such invalid share to avoid
> possible NULL dereference as root->node can be NULL in this case.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 
> ---
> As always, the patch is also pushed to my github repo, along with other
> fuzzed images related fixes:
> https://github.com/adam900710/linux/tree/tree_checker_enhance
> (BTW, is it correct to indicate a branch like above?)
> ---
>  fs/btrfs/extent-tree.c | 27 +++
>  fs/btrfs/relocation.c  |  2 +-
>  2 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da615ebc072e..5f4ca61348b5 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8949,17 +8949,26 @@ static noinline int walk_up_proc(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   if (eb == root->node) {
> - if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
> + if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>   parent = eb->start;
> - else
> - BUG_ON(root->root_key.objectid !=
> -btrfs_header_owner(eb));
> + } else if (root->root_key.objectid != btrfs_header_owner(eb)) {
> + btrfs_err_rl(fs_info,
> + "unexpected tree owner, have %llu expect %llu",
> +  btrfs_header_owner(eb),
> +  root->root_key.objectid);
> + return -EINVAL;

EINVAL or ECLEANUP?

> + }
>   } else {
> - if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
> + if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
>   parent = path->nodes[level + 1]->start;
> - else
> - BUG_ON(root->root_key.objectid !=
> -btrfs_header_owner(path->nodes[level + 1]));
> + } else if (root->root_key.objectid !=
> +btrfs_header_owner(path->nodes[level + 1])) {
> + btrfs_err_rl(fs_info,
> + "unexpected tree owner, have %llu expect %llu",
> +  btrfs_header_owner(eb),
> +  root->root_key.objectid);
> + return -EINVAL;
ditto
> + }
>   }
>  
>   btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1);
> @@ -9020,6 +9029,8 @@ static noinline int walk_up_tree(struct 
> btrfs_trans_handle *trans,
>   ret = walk_up_proc(trans, root, path, wc);
>   if (ret > 0)
>   return 0;
> + if (ret < 0)
> + return ret;
>  
>   if (path->locks[level]) {
>   btrfs_tree_unlock_rw(path->nodes[level],
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index a2fc0bd83a40..c64051d33d05 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1321,7 +1321,7 @@ static void __del_reloc_root(struct btrfs_root *root)
>   struct mapping_node *node = NULL;
>   struct reloc_control *rc = fs_info->reloc_ctl;
>  
> - if (rc) {
> + if (rc && root->node) 

Re: BTRFS and databases

2018-08-01 Thread Mike Fleetwood
On 1 August 2018 at 04:45, MegaBrutal  wrote:

> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
>
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

Also note that no data CoW implies no data checksums too.
https://btrfs.wiki.kernel.org/index.php/FAQ#Can_I_have_nodatacow_.28or_chattr_.2BC.29_but_still_have_checksumming.3F

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-01 Thread Hugo Mills
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> I know it's a decade-old question, but I'd like to hear your thoughts
> of today. By now, I became a heavy BTRFS user. Almost everywhere I use
> BTRFS, except in situations when it is obvious there is no benefit
> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
> computers are mainly running on BTRFS with only a few file systems on
> ext4. I even installed BTRFS in corporate productive systems (in those
> cases, the systems were mainly on ext4; but there were some specific
> file systems those exploited BTRFS features).
> 
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?

   Personally, I'd start with btrfs with autodefrag. It has some
degree of I/O overhead, but if the database isn't performance-critical
and already near the limits of the hardware, it's unlikely to make
much difference. Autodefrag should keep the fragmentation down to a
minimum.

   Hugo.

> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?
> 
> 
> Kind regards,
> MegaBrutal

-- 
Hugo Mills | In theory, theory and practice are the same. In
hugo@... carfax.org.uk | practice, they're different.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS and databases

2018-08-01 Thread Duncan
MegaBrutal posted on Wed, 01 Aug 2018 05:45:15 +0200 as excerpted:

> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW nature
> that is elsewhere a blessing, with databases it's a drawback). But are
> there any advantages of still sticking to BTRFS for a database albeit
> CoW is disabled, or should I just return to the old and reliable ext4
> for those applications?

Good question, on which I might expect some honest disagreement on the 
answer.

Personally, I tend to hate nocow with a passion, and would thus recommend 
putting databases and similar write-pattern (VM images...) files on their 
own dedicated non-btrfs (ext4, etc) if at all reasonable.

But that comes from a general split partition-favoring viewpoint, where 
doing another partition/lvm-volume and putting a different filesystem on 
it is no big deal, as it's just one more partition/volume to manage of 
(likely) several.

Some distros/companies/installations have policies strongly favoring 
btrfs for its "storage pool" features, trying to keep things simple and 
flexible by using just the one solution and one big btrfs and throwing 
everything onto it, often using btrfs subvolumes where others would use 
separate partitions/volumes with independent filesystems.  For these 
folks, the flexibility of being able to throw it all on one filesystem 
with subvolumes overrides the down sides of having to deal with nocow and 
its conditions, rules and additional risk.

And a big part of that flexibility, along with being a feature in its own 
right, is btrfs built-in multi-device, without having to resort to an 
additional multi-device layer such as lvm or mdraid.


So if you're using btrfs for multi-device or other features that nocow 
doesn't affect, it's plausible that you'd prefer nocow on btrfs to 
/having/ to do partitioning/lvm/mdraid and setup that separate non-btrfs 
just for your database (or vm image) files.

But from your post you're perfectly fine with partitioning and the like 
already, and won't consider it a heavy imposition to deal with a separate 
non-btrfs, ext4 or whatever, and in that case, at least here, I'd 
strongly recommend you do just that, avoiding the nocow that I honestly 
see as a compromise best left to those that really need it because they 
aren't prepared to deal with the hassle of setting up the separate 
filesystem along with all that entails.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed on raid1 even after clean scrub?

2018-08-01 Thread Duncan
Sterling Windmill posted on Mon, 30 Jul 2018 21:06:54 -0400 as excerpted:

> Both drives are identical, Seagate 8TB external drives

Are those the "shingled" SMR drives, normally sold as archive drives and 
first commonly available in the 8TB size, and often bought for their 
generally better price-per-TB without fully realizing the implications.

There have been bugs regarding those drives in the past, and while I 
believe those bugs were fixed and AFAIK current status is no known SMR-
specific bugs, they really are /not/ particularly suited to btrfs usage 
even for archiving, and definitely not to general usage (that is, pretty 
much anything but the straight-up archiving use-case they are sold for) 
use-cases.

Of course USB connections are notorious for being unreliable in terms of 
btrfs usage as well, and I'd really hate to think what a combination of 
SMR on USB might wreak.

If they're not SMR then carry-on! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html