Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 7:10 AM, Miguel Ojeda
 wrote:
> On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
>> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>>
 When max() is used in stack array size calculations from literal values
 (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
 thinks this is a dynamic calculation due to the single-eval logic, which
 is not needed in the literal case. This change removes several accidental
 stack VLAs from an x86 allmodconfig build:

 $ diff -u before.txt after.txt | grep ^-
 -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
 variable length array ‘ids’ [-Wvla]
 -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
 array ‘namebuf’ [-Wvla]
 -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array 
 ‘sym’ [-Wvla]
 -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
 ‘buff64’ [-Wvla]

 Based on an earlier patch from Josh Poimboeuf.
>>>
>>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>>
>>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>>> ./include/linux/jiffies.h:444: error: first argument to 
>>> '__builtin_choose_expr' not a constant
>>
>>
>> I'm seeing that problem with
>>> gcc --version
>> gcc (SUSE Linux) 4.8.5
>
> Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
> 5.1.0 and it seems to work as well.
>

Just compiled 4.9.0 and it seems to work -- so that would be the
minimum required.

Sigh...

Some enterprise distros are either already shipping gcc >= 5 or will
probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
to ask for a newer gcc? Are there many users/companies out there using
enterprise distributions' gcc to compile and run the very latest
kernels?

Miguel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>
>>> When max() is used in stack array size calculations from literal values
>>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>>> thinks this is a dynamic calculation due to the single-eval logic, which
>>> is not needed in the literal case. This change removes several accidental
>>> stack VLAs from an x86 allmodconfig build:
>>>
>>> $ diff -u before.txt after.txt | grep ^-
>>> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
>>> variable length array ‘ids’ [-Wvla]
>>> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
>>> array ‘namebuf’ [-Wvla]
>>> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
>>> [-Wvla]
>>> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
>>> ‘buff64’ [-Wvla]
>>>
>>> Based on an earlier patch from Josh Poimboeuf.
>>
>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>
>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>> ./include/linux/jiffies.h:444: error: first argument to 
>> '__builtin_choose_expr' not a constant
>
>
> I'm seeing that problem with
>> gcc --version
> gcc (SUSE Linux) 4.8.5

Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
5.1.0 and it seems to work as well.

Cheers,
Miguel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Change of Ownership of the filesystem content when cloning a volume

2018-03-09 Thread Andrei Borzenkov
10.03.2018 02:13, Saravanan Shanmugham (sarvi) пишет:
> 
> Netapp’s storage system, has the concept of snapshot/clones.
> And when I create a clone from a snapshot, I can give/change ownership of 
> entire tree in the volume to a different userid.

You are probably mistaken. NetApp FlexClone (which you probably mean)
does not have any ways to change volume content. Of course you can now
mount this clone and do whatever you like from host, but that is
completely unrelated to NetApp itself and can just as well be done using
btrfs subvolume.

> 
> Is something like that possible in BTRFS?
> 
> We are looking to use CopyOnWrite to snapshot nightly build workspace and 
> clone as developer workspaces to avoid building from scratch for developers,
> And move directly for incremental builds.
> For this we would like the clone workspace/volume to be instantly owned by 
> the developer cloning the workspace.
> 
> Thanks,
> Sarvi
> -
> Occam's Razor Rules
> 
> 
> N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] btrfs-progs: Split original mode check to its own

2018-03-09 Thread Qu Wenruo


On 2018年03月10日 00:56, David Sterba wrote:
> On Tue, Feb 06, 2018 at 03:02:20PM +0800, Qu Wenruo wrote:
>> This time, there are 2 patches too large to reach mail list, so please
>> fetch the whole patchset from github as usual:
>> https://github.com/adam900710/btrfs-progs/tree/split_check_part2
> 
> I missed this patch series, sorry. The changes look good, there
> are unfortunatelly merge conflicts due to the asan/ubsan fixes that
> should be trivial to resolve. Can you please rebase the branch and let
> mek now? Thanks.

No problem, would send the new patchset and pull request to info you.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt
On Fri, 9 Mar 2018 22:15:23 -0500
Steven Rostedt  wrote:

> Sorry for the spam.

A little more spam ;-)

I know what happened, as I'm able to recreate it. For those that use
claws-mail, be careful.

I clicked on the email I wanted to reply to. Then I must have hit the
down arrow key, as it moved the selected email to the next email. I hit
"Reply All", and it showed the email I clicked on (as well as the
subject), but the Cc list belonged to the email that was selected by the
down arrow key.

That's a nasty subtle bug. I think I'll go report this to the claws
folks.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt

I don't know what the hell happened, but claws mail just inserted a
ton of people into the Cc (I didn't add you). I noticed it just after I
hit send. The added Cc looks like it came from the email right after the
email I was replying to "Subject: Re: [PATCH v3] kernel.h: Skip
single-eval logic on literals in min()/max()".

Sorry for the spam.

-- Steve


On Fri, 9 Mar 2018 22:10:19 -0500
Steven Rostedt  wrote:

> On Fri, 09 Mar 2018 21:34:45 -0500
> Steven Rostedt  wrote:
> 
> >  2 files changed, 1050 insertions(+), 1273 deletions(-)  
> 
> BTW, it's a bit bigger impact than 223 deletions. As I added a lot of
> comments to explain the logic better. Removing comments and white space
> from the modifications we have:
> 
>  649 insertions(+), 1035 deletions(-)
> 
> That's 386 lines of code removed.
> 
> -- Steve

'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Randy Dunlap
On 03/09/2018 04:07 PM, Andrew Morton wrote:
> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
> 
>> When max() is used in stack array size calculations from literal values
>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>> thinks this is a dynamic calculation due to the single-eval logic, which
>> is not needed in the literal case. This change removes several accidental
>> stack VLAs from an x86 allmodconfig build:
>>
>> $ diff -u before.txt after.txt | grep ^-
>> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
>> variable length array ‘ids’ [-Wvla]
>> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
>> array ‘namebuf’ [-Wvla]
>> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
>> [-Wvla]
>> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
>> ‘buff’ [-Wvla]
>> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
>> ‘buff’ [-Wvla]
>> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
>> ‘buff64’ [-Wvla]
>>
>> Based on an earlier patch from Josh Poimboeuf.
> 
> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
> 
> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
> ./include/linux/jiffies.h:444: error: first argument to 
> '__builtin_choose_expr' not a constant


I'm seeing that problem with
> gcc --version
gcc (SUSE Linux) 4.8.5

in mmotm.

> That's with
> 
> #define __max(t1, t2, x, y)   \
>   __builtin_choose_expr(__builtin_constant_p(x) &&\
> __builtin_constant_p(y) &&\
> __builtin_types_compatible_p(t1, t2), \
> (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
> __single_eval_max(t1, t2, \
>   __UNIQUE_ID(max1_), \
>   __UNIQUE_ID(max2_), \
>   x, y))
> /**
>  * max - return maximum of two values of the same or compatible types
>  * @x: first value
>  * @y: second value
>  */
> #define max(x, y) __max(typeof(x), typeof(y), x, y)
> 
> 
> A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> to know that __builtin_constant_p(x) is a constant.  Or something.
> 
> Sigh.  Wasn't there some talk about modernizing our toolchain
> requirements?


-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] tracing: Rewrite filter logic to be simpler and faster

2018-03-09 Thread Steven Rostedt
On Fri, 09 Mar 2018 21:34:45 -0500
Steven Rostedt  wrote:

>  2 files changed, 1050 insertions(+), 1273 deletions(-)

BTW, it's a bit bigger impact than 223 deletions. As I added a lot of
comments to explain the logic better. Removing comments and white space
from the modifications we have:

 649 insertions(+), 1035 deletions(-)

That's 386 lines of code removed.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


zerofree btrfs support?

2018-03-09 Thread Christoph Anton Mitterer
Hi.

Just wondered... was it ever planned (or is there some equivalent) to
get support for btrfs in zerofree?

Thanks,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 5:31 PM, Kees Cook  wrote:
>
> WTF, gmail just blasted HTML into my explicitly plain-text email?! 
> Apologies...

There's more now in your email, I think maybe it's triggered by your
signature file and some gmail web interface bug. Or it just tries to
force quote using html.

There's some html email disease inside google, where some parts of the
company seems to think that html email is _good_.

The official gmail Android app is a big pain, int hat it doesn't
*have* a plain-text mode, even though it knows how to format the
plain-text part of the email.

You might want to be on the lookout for some bad drugs at the office.
Because the world would thank you if you took them away from the gmail
app people.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Kees Cook
On Fri, Mar 9, 2018 at 5:30 PM, Kees Cook  wrote:
> --
> Kees Cook
> Pixel SecurityOn
> [...]

WTF, gmail just blasted HTML into my explicitly plain-text email?! Apologies...

-- 
Kees Cook
Pixel SecurityOn
Fri, Mar 9, 2018 at 5:30 PM, Kees Cook mailto:keesc...@chromium.org;
target="_blank">keesc...@chromium.org
wrote:On
Fri, Mar 9, 2018 at 4:38 PM, Linus Torvalds
mailto:torva...@linux-foundation.org;>torva...@linux-foundation.org
wrote:
 On Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton mailto:a...@linux-foundation.org;>a...@linux-foundation.org
wrote:

 I wonder which gcc versions actually accept Kees's addition.

Ah, my old nemesis, gcc 4.4.4. *sob*

 Note that we already do have this pattern, as seen by:

 git grep -2 __builtin_choose_expr | grep -2
__builtin_constant_p

 which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
 already exists current kernels and _works_ - it apparently just
 doesn't work in slightly more complicated cases.

Fun. Yeah, so all the PIN_GROUP() callers have either a literal or an
array name for that argument, so the argument to
__builtin_constant_p() isn't complex.

git grep '\bPIN_GROUP\b' | egrep -v '(1|2|3|true|false)\)'

 It's one reason why I wondered if simplifying the expression to have
 just that single __builtin_constant_p() might not end up working..

Yeah, it seems like it doesn't bail out as "false" for complex
expressions given to __builtin_constant_p().

If no magic solution, then which of these?

- go back to my original "multi-eval max only for constants" macro (meh)
- add gcc version checks around this and similarly for -Wvla in the
future (eww)
- raise gcc version (yikes)

-Kees

--
Kees Cook
Pixel Securitydiv class="gmail_extra"brdiv
class="gmail_quote"On
Fri, Mar 9, 2018 at 4:38 PM, Linus Torvalds span
dir="ltr"lt;a
href="mailto:mailto:torva...@linux-foundation.org;>torvalds@linux-foundation.org"
target="_blank"mailto:torva...@linux-foundation.org;>torvalds@linux-foundation.org/agt;/span
wrote:brblockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"span
class=""On
Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton lt;a
href="mailto:mailto:a...@linux-foundation.org;>akpm@linux-foundation.org"mailto:a...@linux-foundation.org;>akpm@linux-foundation.org/agt;
wrote:br
gt;br
gt; I wonder which gcc versions actually accept Kees's
addition.br
br
/spanNote that we already do have this pattern, as seen
by:br
br
nbsp; git grep -2nbsp; __builtin_choose_expr | grep -2
__builtin_constant_pbr
br
which show drivers/pinctrl/intel/pinctrl-wbrintel.h, so
the patternbr
already exists current kernels and _works_ - it apparently justbr
doesn't work in slightly more complicated cases.br
br
It's one reason why I wondered if simplifying the expression to
havebr
just that single __builtin_constant_p() might not end up working..br
span
class="HOEnZb"font color="#88"br
nbsp; nbsp; nbsp; nbsp; nbsp; nbsp;
nbsp; Linusbr
/font/span/blockquote/divbrbr
clear="all"divbr/div--
brdiv class="gmail_signature"
data-smartmail="gmail_signature"Kees
CookbrPixel Security/div
/div
--
Kees
CookPixel Security

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Kees Cook
On Fri, Mar 9, 2018 at 4:38 PM, Linus Torvalds
 wrote:
> On Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton  
> wrote:
>>
>> I wonder which gcc versions actually accept Kees's addition.

Ah, my old nemesis, gcc 4.4.4. *sob*

> Note that we already do have this pattern, as seen by:
>
>   git grep -2  __builtin_choose_expr | grep -2 __builtin_constant_p
>
> which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
> already exists current kernels and _works_ - it apparently just
> doesn't work in slightly more complicated cases.

Fun. Yeah, so all the PIN_GROUP() callers have either a literal or an
array name for that argument, so the argument to
__builtin_constant_p() isn't complex.

git grep '\bPIN_GROUP\b' | egrep -v '(1|2|3|true|false)\)'

> It's one reason why I wondered if simplifying the expression to have
> just that single __builtin_constant_p() might not end up working..

Yeah, it seems like it doesn't bail out as "false" for complex
expressions given to __builtin_constant_p().

If no magic solution, then which of these?

- go back to my original "multi-eval max only for constants" macro (meh)
- add gcc version checks around this and similarly for -Wvla in the future (eww)
- raise gcc version (yikes)

-Kees

-- 
Kees Cook
Pixel SecurityOn
Fri, Mar 9, 2018 at 4:38 PM, Linus Torvalds mailto:torva...@linux-foundation.org;
target="_blank">torva...@linux-foundation.org
wrote:On
Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton mailto:a...@linux-foundation.org;>a...@linux-foundation.org
wrote:

 I wonder which gcc versions actually accept Kees's addition.

Note that we already do have this pattern, as seen by:

 git grep -2 __builtin_choose_expr | grep -2
__builtin_constant_p

which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
already exists current kernels and _works_ - it apparently just
doesn't work in slightly more complicated cases.

It's one reason why I wondered if simplifying the expression to have
just that single __builtin_constant_p() might not end up working..

   Linus
--
Kees
CookPixel Security

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4.4 12/36] btrfs: Dont clear SGID when inheriting ACLs

2018-03-09 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Jan Kara 

commit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: sta...@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: David Sterba 
Signed-off-by: Jan Kara 
Signed-off-by: David Sterba 
Signed-off-by: Nikolay Borisov 
Signed-off-by: Greg Kroah-Hartman 
---
 fs/btrfs/acl.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -82,12 +82,6 @@ static int __btrfs_set_acl(struct btrfs_
switch (type) {
case ACL_TYPE_ACCESS:
name = POSIX_ACL_XATTR_ACCESS;
-   if (acl) {
-   ret = posix_acl_update_mode(inode, >i_mode, 
);
-   if (ret)
-   return ret;
-   }
-   ret = 0;
break;
case ACL_TYPE_DEFAULT:
if (!S_ISDIR(inode->i_mode))
@@ -123,6 +117,13 @@ out:
 
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
+   int ret;
+
+   if (type == ACL_TYPE_ACCESS && acl) {
+   ret = posix_acl_update_mode(inode, >i_mode, );
+   if (ret)
+   return ret;
+   }
return __btrfs_set_acl(NULL, inode, acl, type);
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton  wrote:
>
> I wonder which gcc versions actually accept Kees's addition.

Note that we already do have this pattern, as seen by:

  git grep -2  __builtin_choose_expr | grep -2 __builtin_constant_p

which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
already exists current kernels and _works_ - it apparently just
doesn't work in slightly more complicated cases.

It's one reason why I wondered if simplifying the expression to have
just that single __builtin_constant_p() might not end up working..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton
On Fri, 9 Mar 2018 16:28:51 -0800 Linus Torvalds 
 wrote:

> On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  
> wrote:
> >
> > A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> > to know that __builtin_constant_p(x) is a constant.  Or something.
> 
> LOL.
> 
> I suspect it might be that it wants to evaluate
> __builtin_choose_expr() at an earlier stage than it evaluates
> __builtin_constant_p(), so it's not that it doesn't know that
> __builtin_constant_p() is a constant, it just might not know it *yet*.
> 
> Maybe.
> 
> Side note, if it's not that, but just the "complex" expression that
> has the logical 'and' etc, maybe the code could just use
> 
>   __builtin_constant_p((x)+(y))
> 
> or something.

I'll do a bit more poking at it.

> But yeah:
> 
> > Sigh.  Wasn't there some talk about modernizing our toolchain
> > requirements?
> 
> Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
> and once we upgrade to 4.5 I think Arnd said that no distro actually
> ships it, so we might as well go to 4.6.
> 
> So maybe this is just the excuse to finally make that official, if
> there is no clever workaround any more.

I wonder which gcc versions actually accept Kees's addition.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  wrote:
>
> A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> to know that __builtin_constant_p(x) is a constant.  Or something.

LOL.

I suspect it might be that it wants to evaluate
__builtin_choose_expr() at an earlier stage than it evaluates
__builtin_constant_p(), so it's not that it doesn't know that
__builtin_constant_p() is a constant, it just might not know it *yet*.

Maybe.

Side note, if it's not that, but just the "complex" expression that
has the logical 'and' etc, maybe the code could just use

  __builtin_constant_p((x)+(y))

or something.

But yeah:

> Sigh.  Wasn't there some talk about modernizing our toolchain
> requirements?

Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
and once we upgrade to 4.5 I think Arnd said that no distro actually
ships it, so we might as well go to 4.6.

So maybe this is just the excuse to finally make that official, if
there is no clever workaround any more.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: dump-tree: add degraded option

2018-03-09 Thread Anand Jain



On 03/10/2018 12:23 AM, David Sterba wrote:

On Thu, Feb 22, 2018 at 10:00:11PM +0800, Anand Jain wrote:

btrfs inspect dump-tree cli picks the disk with the largest generation
to read the root tree, even when all the devices were not provided in
the cli. But in 2 disks RAID1 you may need to know what's in the disks
individually, so this option -x | --degraded indicates to use only the
given disk to dump.

Signed-off-by: Anand Jain 
---
  cmds-inspect-dump-tree.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index df44bb635c9c..587b6081df0c 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -198,6 +198,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
"-u|--uuid  print only the uuid tree",
"-b|--block  print info from the specified block only",
"-t|--tree print only tree with the given id (string or 
number)",
+   "-x|--degraded  For RAID1, use the disk in the arg, do not scan for 
disks",


I find this confusing, degraded for kernel means 'mount with anything
you have' while here it means 'use exactly that disk'.


 Agreed.


Something that would reflect the "don't scan" would be IMHO better, but
otherwise the goal of the patch makes sense.


 Hope this is better.

  -x|--noscan


Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton
On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.

v1, v2 and v3 of this patch all fail with gcc-4.4.4:

./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
./include/linux/jiffies.h:444: error: first argument to '__builtin_choose_expr' 
not a constant

That's with

#define __max(t1, t2, x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y) &&\
  __builtin_types_compatible_p(t1, t2), \
  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
  __single_eval_max(t1, t2, \
__UNIQUE_ID(max1_), \
__UNIQUE_ID(max2_), \
x, y))
/**
 * max - return maximum of two values of the same or compatible types
 * @x: first value
 * @y: second value
 */
#define max(x, y)   __max(typeof(x), typeof(y), x, y)


A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
to know that __builtin_constant_p(x) is a constant.  Or something.

Sigh.  Wasn't there some talk about modernizing our toolchain
requirements?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


call trace on btrfs send/receive

2018-03-09 Thread Christoph Anton Mitterer
Hey.

The following still happens with 4.15 kernel/progs:

btrfs send -p oldsnap newsnap | btrfs receive /some/other/fs

Mar 10 00:48:10 heisenberg kernel: WARNING: CPU: 5 PID: 32197 at 
/build/linux-PFKtCE/linux-4.15.4/fs/btrfs/send.c:6487 
btrfs_ioctl_send+0x48f/0xfb0 [btrfs]
Mar 10 00:48:10 heisenberg kernel: Modules linked in: udp_diag tcp_diag 
inet_diag algif_skcipher af_alg uas vhost_net vhost tap xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat tun bridge stp llc ctr ccm fuse ebtable_filter ebtables devlink 
cpufreq_userspace cpufreq_powersave cpufreq_conservative ip6t_REJECT 
nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 
nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack binfmt_misc 
iptable_filter joydev snd_hda_codec_hdmi snd_hda_codec_realtek 
snd_hda_codec_generic arc4 intel_rapl x86_pkg_temp_thermal intel_powerclamp 
coretemp btusb btrtl btbcm btintel kvm_intel iwldvm bluetooth kvm irqbypass 
rtsx_pci_sdmmc rtsx_pci_ms memstick mmc_core
Mar 10 00:48:10 heisenberg kernel:  mac80211 iTCO_wdt crct10dif_pclmul 
iTCO_vendor_support uvcvideo crc32_pclmul videobuf2_vmalloc videobuf2_memops 
ghash_clmulni_intel videobuf2_v4l2 drbg intel_cstate videobuf2_core iwlwifi 
ansi_cprng intel_uncore videodev ecdh_generic crc16 media intel_rapl_perf sg 
psmouse i915 i2c_i801 snd_hda_intel pcspkr cfg80211 rtsx_pci snd_hda_codec 
rfkill snd_hda_core snd_hwdep drm_kms_helper fujitsu_laptop snd_pcm 
sparse_keymap drm video snd_timer ac button snd battery mei_me lpc_ich 
soundcore i2c_algo_bit mei mfd_core shpchp loop parport_pc ppdev sunrpc lp 
parport ip_tables x_tables autofs4 dm_crypt dm_mod raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid1 
raid0 multipath linear md_mod btrfs crc32c_generic xor zstd_decompress 
zstd_compress xxhash raid6_pq
Mar 10 00:48:10 heisenberg kernel:  uhci_hcd sd_mod usb_storage crc32c_intel 
ahci libahci aesni_intel libata ehci_pci aes_x86_64 evdev xhci_pci crypto_simd 
cryptd glue_helper xhci_hcd ehci_hcd serio_raw scsi_mod e1000e ptp usbcore 
pps_core usb_common
Mar 10 00:48:10 heisenberg kernel: CPU: 5 PID: 32197 Comm: btrfs Not tainted 
4.15.0-1-amd64 #1 Debian 4.15.4-1
Mar 10 00:48:10 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
E782/FJNB253, BIOS Version 2.11 07/15/2014
Mar 10 00:48:10 heisenberg kernel: RIP: 0010:btrfs_ioctl_send+0x48f/0xfb0 
[btrfs]
Mar 10 00:48:10 heisenberg kernel: RSP: 0018:a4cc0a377c48 EFLAGS: 00010293
Mar 10 00:48:10 heisenberg kernel: RAX:  RBX: 958718b1140c 
RCX: 0001
Mar 10 00:48:10 heisenberg kernel: RDX: 0001 RSI: 0015 
RDI: 958718b1140c
Mar 10 00:48:10 heisenberg kernel: RBP: 9587617c1c00 R08: 4000 
R09: 0060
Mar 10 00:48:10 heisenberg kernel: R10: 0015 R11: 0246 
R12: 958718b11000
Mar 10 00:48:10 heisenberg kernel: R13: 9587b7cfdad0 R14: 95850d8d4000 
R15: 958718b11000
Mar 10 00:48:10 heisenberg kernel: FS:  7f5f0866a8c0() 
GS:95881e34() knlGS:
Mar 10 00:48:10 heisenberg kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Mar 10 00:48:10 heisenberg kernel: CR2: 7f5f073a4e38 CR3: 0001e6b56004 
CR4: 001606e0
Mar 10 00:48:10 heisenberg kernel: Call Trace:
Mar 10 00:48:10 heisenberg kernel:  ? kmem_cache_alloc_trace+0x14b/0x1a0
Mar 10 00:48:10 heisenberg kernel:  ? 
insert_reserved_file_extent.constprop.69+0x2c1/0x2f0 [btrfs]
Mar 10 00:48:10 heisenberg kernel:  ? btrfs_opendir+0x3e/0x70 [btrfs]
Mar 10 00:48:10 heisenberg kernel:  ? _cond_resched+0x15/0x40
Mar 10 00:48:10 heisenberg kernel:  ? __kmalloc_track_caller+0x190/0x220
Mar 10 00:48:10 heisenberg kernel:  ? __check_object_size+0xaf/0x1b0
Mar 10 00:48:10 heisenberg kernel:  _btrfs_ioctl_send+0x80/0x110 [btrfs]
Mar 10 00:48:10 heisenberg kernel:  ? task_change_group_fair+0xb3/0x100
Mar 10 00:48:10 heisenberg kernel:  ? cpu_cgroup_fork+0x66/0x90
Mar 10 00:48:10 heisenberg kernel:  btrfs_ioctl+0xfab/0x2450 [btrfs]
Mar 10 00:48:10 heisenberg kernel:  ? enqueue_entity+0x106/0x6b0
Mar 10 00:48:10 heisenberg kernel:  ? enqueue_task_fair+0x67/0x7d0
Mar 10 00:48:10 heisenberg kernel:  ? do_vfs_ioctl+0xa4/0x630
Mar 10 00:48:10 heisenberg kernel:  do_vfs_ioctl+0xa4/0x630
Mar 10 00:48:10 heisenberg kernel:  ? _do_fork+0x14d/0x3f0
Mar 10 00:48:10 heisenberg kernel:  SyS_ioctl+0x74/0x80
Mar 10 00:48:10 heisenberg kernel:  do_syscall_64+0x6f/0x130
Mar 10 00:48:10 heisenberg kernel:  entry_SYSCALL_64_after_hwframe+0x21/0x86
Mar 10 00:48:10 heisenberg kernel: RIP: 0033:0x7f5f07493f07
Mar 10 00:48:10 heisenberg kernel: RSP: 002b:7fff8a4619d8 EFLAGS: 0246 
ORIG_RAX: 0010
Mar 10 00:48:10 heisenberg kernel: RAX: ffda RBX: 55b941872270 
RCX: 7f5f07493f07
Mar 10 00:48:10 heisenberg kernel: 

Change of Ownership of the filesystem content when cloning a volume

2018-03-09 Thread Saravanan Shanmugham (sarvi)

Netapp’s storage system, has the concept of snapshot/clones.
And when I create a clone from a snapshot, I can give/change ownership of 
entire tree in the volume to a different userid.

Is something like that possible in BTRFS?

We are looking to use CopyOnWrite to snapshot nightly build workspace and clone 
as developer workspaces to avoid building from scratch for developers,
And move directly for incremental builds.
For this we would like the clone workspace/volume to be instantly owned by the 
developer cloning the workspace.

Thanks,
Sarvi
-
Occam's Razor Rules




Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Kees Cook
On Fri, Mar 9, 2018 at 1:10 PM, Linus Torvalds
 wrote:
> On Fri, Mar 9, 2018 at 12:05 PM, Kees Cook  wrote:
>> When max() is used in stack array size calculations from literal values
>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>> thinks this is a dynamic calculation due to the single-eval logic, which
>> is not needed in the literal case. This change removes several accidental
>> stack VLAs from an x86 allmodconfig build:
>
> Ok, looks good.
>
> I just have a couple of questions about applying it.
>
> In particular, if this will help people working on getting rid of
> VLA's in the short term, I can apply it directly.

AFAIK, all the VLAs effected by max() are fixed with this patch. i.e.
no other VLA work I'm aware of depends on this (famous last words).

> But if people who are looking at it (anybody else than Kees?)

FWIW, I've seen at least 6 other people helping out now with VLA clean
up patches. Whee. :)

> don't much care, then this
> might be a 4.17 thing or at least "random -mm queue"?

Andrew has this (v2) queued in -mm for 4.17. I didn't view VLA clean
up (or this macro change) as urgent by any means.

> #define __single_eval_max(max1, max2, x, y) ({  \
> typeof (x) max1 = (x);  \
> typeof (y) max2 = (y);  \
> max1 > max2 ? max1 : max2; })
>
> actually looks more natural to me than passing the two types in as
> arguments to the macro.

I (obviously) didn't design this macro originally, but my take on this
was that it's safer to keep the type check together with the
comparison so that it is never possible for someone to run off and use
__single_eval_max() directly and accidentally bypass the type check.
While it does look silly from the max_t()/min_t() perspective, it just
seems more defensive.

> (That form also is amenable to things like "__auto_type" etc simplifications).

Agreed, I think that's the biggest reason to lift the types. However,
since we're still not actually doing anything with the types (i.e.
this change doesn't weaken the type checking), I would think it would
be orthogonal. While it would be nice to resolve differing types
sanely, there doesn't appear to be a driving reason to make this
change. The example case of max(5, sizeof(thing)) doesn't currently
exist in the code and I don't see how making min()/max() _less_ strict
would be generically useful (in fact, it has proven useful to have
strict type checking).

> Side note: do we *really* need the unique variable names? That's what
> makes those things _really_ illegible. I thgink it's done just for a
> sparse warning that we should probably ignore. It's off by default
> anyway exactly because it doesn't work that well due to nested macro
> expansions like this.

That I'm not sure about. I'd be fine to remove them; I left them in
place because it seemed quite intentional. :)

> There is very real value to keeping our odd macros legible, I feel.
> Even when they are complicated by issues like this, it would be good
> to keep them as simple as possible.
>
> Comments?

I'm open to whatever! I'm just glad this gets to kill a handful of
"accidental" stack VLAs. :)

-Kees

-- 
Kees Cook
Pixel SecurityOn
Fri, Mar 9, 2018 at 1:10 PM, Linus Torvalds mailto:torva...@linux-foundation.org;
target="_blank">torva...@linux-foundation.org
wrote:On
Fri, Mar 9, 2018 at 12:05 PM, Kees Cook mailto:keesc...@chromium.org;>keesc...@chromium.org
wrote:
 When max() is used in stack array size calculations from literal values
 (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
 thinks this is a dynamic calculation due to the single-eval
logic, which
 is not needed in the literal case. This change removes several
accidental
 stack VLAs from an x86 allmodconfig build:

Ok, looks good.

I just have a couple of questions about applying it.

In particular, if this will help people working on getting rid of
VLA's in the short term, I can apply it directly. But if people who
are looking at it (anybody else than Kees?) don't much care, then this
might be a 4.17 thing or at least "random -mm queue"?

The other unrelated reaction I had to this was that "we're passing
those types down very deep, even though nobody _cares_ about them all
that much at that deep level".

Honestly, the only case that really cares is the very top level, and
the rest could just take the properly cast versions.

For example, "max_t/min_t" really don't care at all, since they - by
definition - just take the single specified type.

So I'm wondering if we should just drop the types from __max/__min
(and everything they call) entirely, and instead do

  #define __check_type(x,y)
((void)((typeof(x)*)1==(typeof(y)*)1))
  #define min(x,y) (__check_type(x,y),__min(x,y))
  #define max(x,y) (__check_type(x,y),__max(x,y))

  #define min_t(t,x,y) __min((t)(x),(t)(y))
  #define max_t(t,x,y) __max((t)(x),(t)(y))

and then __min/__max and friends 

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 12:05 PM, Kees Cook  wrote:
> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:

Ok, looks good.

I just have a couple of questions about applying it.

In particular, if this will help people working on getting rid of
VLA's in the short term, I can apply it directly. But if people who
are looking at it (anybody else than Kees?) don't much care, then this
might be a 4.17 thing or at least "random -mm queue"?

The other unrelated reaction I had to this was that "we're passing
those types down very deep, even though nobody _cares_ about them all
that much at that deep level".

Honestly, the only case that really cares is the very top level, and
the rest could just take the properly cast versions.

For example, "max_t/min_t" really don't care at all, since they - by
definition - just take the single specified type.

So I'm wondering if we should just drop the types from __max/__min
(and everything they call) entirely, and instead do

#define __check_type(x,y) ((void)((typeof(x)*)1==(typeof(y)*)1))
#define min(x,y)   (__check_type(x,y),__min(x,y))
#define max(x,y)   (__check_type(x,y),__max(x,y))

#define min_t(t,x,y) __min((t)(x),(t)(y))
#define max_t(t,x,y) __max((t)(x),(t)(y))

and then __min/__max and friends are much simpler (and can just assume
that the type is already fine, and the casting has been done).

This is technically entirely independent of this VLA cleanup thing,
but the "passing the types around unnecessarily" just becomes more
obvious when there's now another level of macros, _and_ it's a more
complex expression too.

Yes, yes, the __single_eval_xyz() functions still end up wanting the
types for the declaration of the new single-evaluation variables, but
the 'typeof' pattern is the standard pattern, so

#define __single_eval_max(max1, max2, x, y) ({  \
typeof (x) max1 = (x);  \
typeof (y) max2 = (y);  \
max1 > max2 ? max1 : max2; })

actually looks more natural to me than passing the two types in as
arguments to the macro.

(That form also is amenable to things like "__auto_type" etc simplifications).

Side note: do we *really* need the unique variable names? That's what
makes those things _really_ illegible. I thgink it's done just for a
sparse warning that we should probably ignore. It's off by default
anyway exactly because it doesn't work that well due to nested macro
expansions like this.

There is very real value to keeping our odd macros legible, I feel.
Even when they are complicated by issues like this, it would be good
to keep them as simple as possible.

Comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Kees Cook
When max() is used in stack array size calculations from literal values
(e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
thinks this is a dynamic calculation due to the single-eval logic, which
is not needed in the literal case. This change removes several accidental
stack VLAs from an x86 allmodconfig build:

$ diff -u before.txt after.txt | grep ^-
-drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
variable length array ‘ids’ [-Wvla]
-fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length array 
‘namebuf’ [-Wvla]
-lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
[-Wvla]
-net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
[-Wvla]
-net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array ‘buff64’ 
[-Wvla]

Based on an earlier patch from Josh Poimboeuf.

Signed-off-by: Kees Cook 
---
v3:
- drop __builtin_types_compatible_p() (Rasmus, Linus)
v2:
- fix copy/paste-o max1_/max2_ (ijc)
- clarify "compile-time" constant in comment (Rasmus)
- clean up formatting on min_t()/max_t()
---
 include/linux/kernel.h | 48 ++--
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..a0fca4deb3ab 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -787,37 +787,55 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * strict type-checking.. See the
  * "unnecessary" pointer comparison.
  */
-#define __min(t1, t2, min1, min2, x, y) ({ \
+#define __single_eval_min(t1, t2, min1, min2, x, y) ({ \
t1 min1 = (x);  \
t2 min2 = (y);  \
(void) ( == );\
min1 < min2 ? min1 : min2; })
 
+/*
+ * In the case of compile-time constant values, there is no need to do
+ * the double-evaluation protection, so the raw comparison can be made.
+ * This allows min()/max() to be used in stack array allocations and
+ * avoid the compiler thinking it is a dynamic value leading to an
+ * accidental VLA.
+ */
+#define __min(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_constant_p(x) &&\
+ __builtin_constant_p(y),  \
+ (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
+ __single_eval_min(t1, t2, \
+   __UNIQUE_ID(min1_), \
+   __UNIQUE_ID(min2_), \
+   x, y))
+
 /**
  * min - return minimum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define min(x, y)  \
-   __min(typeof(x), typeof(y), \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define min(x, y)  __min(typeof(x), typeof(y), x, y)
 
-#define __max(t1, t2, max1, max2, x, y) ({ \
+#define __single_eval_max(t1, t2, max1, max2, x, y) ({ \
t1 max1 = (x);  \
t2 max2 = (y);  \
(void) ( == );\
max1 > max2 ? max1 : max2; })
 
+#define __max(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_constant_p(x) &&\
+ __builtin_constant_p(y),  \
+ (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
+ __single_eval_max(t1, t2, \
+   __UNIQUE_ID(max1_), \
+   __UNIQUE_ID(max2_), \
+   x, y))
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define max(x, y)  \
-   __max(typeof(x), typeof(y), \
- __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
- x, y)
+#define max(x, y)  __max(typeof(x), typeof(y), x, y)
 
 /**
  * min3 - return minimum of three values
@@ -869,10 +887,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * @x: first value
  * @y: second value
  */
-#define min_t(type, x, y)  \
-   __min(type, type,   \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define min_t(type, x, y)   __min(type, type, x, y)
 
 /**
  * 

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Nikolay Borisov


On  9.03.2018 21:03, Martin Svec wrote:
> Being curious, how do you know that my global rsv is almost unused? During 
> backups, I see that
> the usage is sometimes very close to the 512 MiB limit. I tried to increase 
> the limit to 1 GiB but it
> didn't help.

Well according to your output:
 Global reserve:  512.00MiB  (used: 0.00B)

In the end it might turn out that you are indeed hitting enospc because
you can't free enough in the critical section of the transaction.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-09 Thread Alex Adriaanse
On Mar 9, 2018, at 3:54 AM, Nikolay Borisov  wrote:
> 
>> Sorry, I clearly missed that one. I have applied the patch you referenced 
>> and rebooted the VM in question. This morning we had another FS failure on 
>> the same machine that caused it to go into readonly mode. This happened 
>> after that device was experiencing 100% I/O utilization for some time. No 
>> balance was running at the time; last balance finished about 6 hours prior 
>> to the error.
>> 
>> Kernel messages:
>> [211238.262683] use_block_rsv: 163 callbacks suppressed
>> [211238.262683] BTRFS: block rsv returned -28
>> [211238.266718] [ cut here ]
>> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
>> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
>> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
>> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
>> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
>> ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
>> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
>> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
>> drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
>> drm ena scsi_mod i2c_piix4 button
>> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: GW
>>4.14.13 #3
>> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>> [211238.330742] task: 9cb43abb70c0 task.stack: b234c3b58000
>> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.340454] RSP: 0018:b234c3b5b958 EFLAGS: 00010282
>> [211238.344782] RAX: 001d RBX: 9cb43bdea128 RCX: 
>> 
>> [211238.350562] RDX:  RSI: 9cb440a166f8 RDI: 
>> 9cb440a166f8
>> [211238.356066] RBP: 4000 R08: 0001 R09: 
>> 7d81
>> [211238.361649] R10: 0001 R11: 7d81 R12: 
>> 9cb43bdea000
>> [211238.367304] R13: 9cb437f2c800 R14: 0001 R15: 
>> ffe4
>> [211238.372658] FS:  () GS:9cb440a0() 
>> knlGS:
>> [211238.379048] CS:  0010 DS:  ES:  CR0: 80050033
>> [211238.384681] CR2: 7f90a6677000 CR3: 0003cea0a006 CR4: 
>> 001606f0
>> [211238.391380] DR0:  DR1:  DR2: 
>> 
>> [211238.398050] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [211238.404730] Call Trace:
>> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
>> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
>> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
>> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
>> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
>> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
>> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
>> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
>> [211238.442899]  ? remove_wait_queue+0x60/0x60
>> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
>> [211238.450578]  kthread+0xfc/0x130
>> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
>> [211238.458381]  ? kthread_create_on_node+0x70/0x70
>> [211238.462225]  ? do_group_exit+0x3a/0xa0
>> [211238.465586]  ret_from_fork+0x1f/0x30
>> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 
>> f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 
>> <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
>> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
>> [211238.486524] BTRFS info (device xvdc): space_info 4 has 
>> 18446744073258958848 free, is not full
>> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, 
>> used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, 
>> readonly=65536
> 
> Ok so the numbers here are helpful, they show that we have enough space
> to allocate a chunk. I've also looked at the logic in 4.14.13 and all
> the necessary patches are there. Unfortunately none of this matters due
> to the fact that reserve_metadata_bytes is being called with
> BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
> any effort to flush anything at all.
> 
> Can you tell again what the workload is - is it some sort of a database,
> constantly writing to its files?

Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load 
on these machines. We also have snapshots being deleted and created every 15 
minutes. Looking at historical atop data for the two most recent crashes:

1. Right 

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>
> On 23.02.2018 16:28, Martin Svec wrote:
>> Hello,
>>
>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>> Sometimes,
>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>> there's 
>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>> this
>> isn't another edge condition when btrfs runs out of space due to fragmented 
>> chunks,
>> but a bug in disk space allocation code. It suffices to umount the 
>> filesystem and
>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>> dependent on metadata chunks usage. When there's a lot of free space in 
>> existing
>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>> are
>> above ~98%, we hit the bug every few days. Below are details regarding the 
>> backup
>> server and btrfs.
>>
>> The backup works as follows: 
>>
>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>> data
>> from a production server into it. This snapshot is then marked read-only 
>> and
>> will be used as a base subvolume for the next backup snapshot.
>>   * Every day, expired snapshots are removed and their space is freed. 
>> Cleanup
>> is scheduled in such a way that it doesn't interfere with the backup 
>> window.
>>   * Multiple production servers are backed up in parallel to one backup 
>> server.
>>   * The backed up servers are mostly webhosting servers and mail servers, 
>> i.e.
>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>> :-))
>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>   * Rsync is configured to use whole file copying.
>>
>> System configuration:
>>
>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>> (see below) and
>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>> metadata_reserve_bytes)
>>
>> btrfs mount options: 
>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>
>> $ btrfs fi df /backup:
>>
>> Data, single: total=28.05TiB, used=26.37TiB
>> System, single: total=32.00MiB, used=3.53MiB
>> Metadata, single: total=255.00GiB, used=250.73GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> $ btrfs fi show /backup:
>>
>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>> Total devices 2 FS bytes used 26.62TiB
>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>
>> $ btrfs fi usage /backup:
>>
>> Overall:
>> Device size:  36.79TiB
>> Device allocated: 28.30TiB
>> Device unallocated:8.49TiB
>> Device missing:  0.00B
>> Used: 26.62TiB
>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>> Data ratio:   1.00
>> Metadata ratio:   1.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Data,single: Size:28.05TiB, Used:26.37TiB
>>/dev/sdc   28.05TiB
>>
>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>/dev/sdb  255.00GiB
>>
>> System,single: Size:32.00MiB, Used:3.53MiB
>>/dev/sdb   32.00MiB
>>
>> Unallocated:
>>/dev/sdb  161.59GiB
>>/dev/sdc8.33TiB
>>
>> Btrfs filesystem uses two logical drives in single mode, backed by
>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>
>> Please note that we have a simple custom patch in btrfs which ensures
>> that metadata chunks are allocated preferably on SSD volume and data
>> chunks are allocated only on SATA volume. The patch slightly modifies
>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>> devices when a metadata chunk is requested and vice versa. However, 
>> I'm quite sure that this patch doesn't cause the reported bug because
>> we log every call of the modified code and there're no __btrfs_alloc_chunk()
>> calls when ENOSPC is triggered. Moreover, we observed the same bug before
>> we developed the patch. (IIRC, Chris Mason mentioned that they work on
>> a similar feature in facebook, but I've found no official patches yet.)
>>
>> Dmesg dump:
>>
>> [285167.750763] use_block_rsv: 62468 callbacks suppressed
>> [285167.750764] BTRFS: block rsv returned -28
>> [285167.750789] [ cut here ]
>> [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 
>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
>> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
>> ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
>> nf_nat nf_conntr
>> [285167.750853]  zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse 
>> uhci_hcd 

Re: [PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs

2018-03-09 Thread David Sterba
On Thu, Feb 22, 2018 at 09:43:37AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月21日 23:40, David Sterba wrote:
> > On Fri, Feb 09, 2018 at 03:44:24PM +0800, Qu Wenruo wrote:
> >> Used by later btrfs_alloc_chunk() rework.
> > 
> > We have the libc provided qsort, so please don't pull another
> > implementation but rather add a wrapper.
> 
> Thanks for pointing this out.
> 
> Another 100 lines saved.
> 
> > 
> >> --- /dev/null
> >> +++ b/kernel-lib/sort.c
> >> @@ -0,0 +1,104 @@
> >> +/*
> >> + * taken from linux kernel lib/sort.c, removed kernel config code and 
> >> adapted
> >> + * for btrfsprogs
> >> + */
> >> +
> >> +#include "sort.h"
> >> +
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
> >> + *
> >> + * Jan 23 2005  Matt Mackall 
> >> + */
> > 
> > When taking files from kernel: take the file as is and insert notes for
> > btrfs-progs below the comments. The SPDX line must be first.
> 
> Will keep that in mind.
> 
> Although quite a lot of kernel-libs doesn't follow this behavior.
> 
> Maybe we should put some README into kernel-libs/?

This should be a part of development documentation, it's in
Documentation, so far incomplete but adding the 'how to sync shared
kernel code' would be welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistence between sender and receiver

2018-03-09 Thread David Sterba
On Fri, Mar 09, 2018 at 09:45:30AM +0300, Andrei Borzenkov wrote:
> 09.03.2018 08:38, Liu Bo пишет:
> > On Thu, Mar 08, 2018 at 09:15:50AM +0300, Andrei Borzenkov wrote:
> >> 07.03.2018 21:49, Liu Bo пишет:
> >>> Hi,
> >>>
> >>> In the following steps[1], if  on receiver side has got
> >>> changed via 'btrfs property set', then after doing incremental
> >>> updates, receiver gets a different snapshot from what sender has sent.
> >>>
> >>> The reason behind it is that there is no change about file 'foo' in
> >>> the send stream, such that receiver simply creates a snapshot of
> >>>  on its side with nothing to apply from the send stream.
> >>>
> >>> A possible way to avoid this is to check rtransid and ctranid of
> >>>  on receiver side, but I'm not very sure whether the current
> >>> behavior is made deliberately, does anyone have an idea? 
> >>>
> >>> Thanks,
> >>>
> >>> -liubo
> >>>
> >>> [1]:
> >>> $ btrfs sub create /mnt/send/sub
> >>> $ touch /mnt/send/sub/foo
> >>> $ btrfs sub snap -r /mnt/send/sub /mnt/send/parent
> >>>
> >>> # send parent out
> >>> $ btrfs send /mnt/send/parent | btrfs receive /mnt/recv/
> >>>
> >>> # change parent and file under it
> >>> $ btrfs property set -t subvol /mnt/recv/parent ro false
> >>
> >> Is removing the ability to modify read-only property an option? What are
> >> use cases for it? What can it do that "btrfs sub snap read-only
> >> writable" cannot?
> >>
> > 
> > Tbh, I don't know any usecase for that, I just wanted to toggle off
> > the ro bit in order to change something inside .
> > 
> >>> $ truncate -s 4096 /mnt/recv/parent/foo
> >>>
> >>> $ btrfs sub snap -r /mnt/send/sub /mnt/send/update
> >>> $ btrfs send -p /mnt/send/parent /mnt/send/update | btrfs receive 
> >>> /mnt/recv
> >>>
> >>
> >> This should fail right away because /mnt/send/parent is not read-only.
> >> If it does not, this is really a bug.
> >>
> > 
> > It's not '/mnt/send/parent', but '/mnt/recv/parent' got changed.
> > 
> 
> Ah, sorry.
> 
> What happened to this patch which clears received_uuid when ro is
> flipped off?
> 
> https://patchwork.kernel.org/patch/9986521/

There are still things to fix:

"Shouldn't we also wipe the other related fields here, like stime, rtime,
stransid, rtransid?"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-09 Thread Austin S. Hemmelgarn

On 2018-03-09 11:53, Paul Richards wrote:

Fantastic response!  Thank you.

I haven’t investigated how broken the failed drive is, I just shutdown 
as soon as I noticed.


The 3 drives were 8, 8 and 2 TB.  The 2TB one failed and I’m replacing 
it with a new 8TB.  So the new drive is indeed larger.  If I do a 
“replace” I’ll end up with the same block distribution as before, so 
would likely want to balance afterwards.
Yes, you probably do, but you'll also need to resize the device first 
(which I forgot to mention in my reply), as replace doesn't expand that 
part of the volume to fill the new device.


I think, but I’ll need to confirm, that I have enough free space to do a 
mount degraded, delete, remount non-degraded again, then add, and 
rebalance.  This will leave me in degraded mode for the shortest time if 
my understanding is correct.
Assuming you can fit all the data on the two 8TB drives, then yes this 
will result int he shortest amount of time running degraded (although, 
if the failed drive is mostly working, you may not need to mount 
degraded at all to do this), though keep in mind that this will also 
result in significant load on the other disks and will give you degraded 
performance for the longest amount of time.


Thanks again for your notes, they should be on the wiki..  :)
I've been meaning to add it for a while actually, I just haven't gotten 
around to it yet.




On Fri, 9 Mar 2018 at 16:43, Austin S. Hemmelgarn > wrote:


On 2018-03-09 11:02, Paul Richards wrote:
 > Hello there,
 >
 > I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive.
 > Before I attempt any recovery I’d like to ask what is the recommended
 > approach?  (The wiki docs suggest consulting here before attempting
 > recovery[1].)
 >
 > The system is powered down currently and a replacement drive is being
 > delivered soon.
 >
 > Should I use “replace”, or “add” and “delete”?
 >
 > Once replaced should I rebalance and/or scrub?
 >
 > I believe that the recovery may involve mounting in degraded
mode.  If
 > I do this, how do I later get out of degraded mode, or if it’s
 > automatic how do i determine when I’m out of degraded mode?
 >
It won't automatically mount degraded, you either have to explicitly ask
it to, or you have to have an option to do so in your default mount
options for the volume in /etc/fstab (which is dangerous for multiple
reasons).

Now, as to what the best way to go about this is, there are three things
to consider:

1. Is the failed disk still usable enough that you can get good data off
of it in a reasonable amount of time?  If you're replacing the disk
because of a lot of failed sectors, you can still probably get data off
of it, while something like a head crash isn't worth trying to get data
back.
2. Do you have enough room in the system itself to add another disk
without removing one?
3. Is the replacement disk at least as big as the failed disk?

If the answer to all three is yes, then just put in the new disk, mount
the volume normally (you don't need to mount it degraded if the failed
disk is working this well), and use `btrfs replace` to move the data.
This is the most efficient option in terms of both time and is also
generally the safest (and I personally always over-spec drive-bays in
systems we build where I work specifically so that this approach can be
used).

If the answer to the third question is no, put in the new disk (removing
the failed one first if the answer to the second question is no), mount
the volume (mount it degraded if one of the first two questions is no,
normally otherwise), then add the new disk to the volume with `btrfs
device add` and remove the old one with `btrfs device delete` (using the
'missing' option if you had to remove the failed disk).  This is needed
because the replace operation requires the new device to be at least as
big as the old one.

If the answer to either one or two is no but the answer to three is yes,
pull out the failed disk, put in a new one, mount the volume degraded,
and use `btrfs replace` as well (you will need to specify the device ID
for the now missing failed disk, which you can find by calling `btrfs
filesystem show` on the volume).  In the event that the replace
operation refuses to run in this case, instead add the new disk to the
volume with `btrfs device add` and then run `btrfs device delete
missing` on the volume.

If you follow any of the above procedures, you don't need to balance
(the replace operation is equivalent to a block level copy and will
result in data being distributed exactly the same as it was before,
while the delete operation is a special type of balance), and you
generally don't need to scrub 

Re: [PATCH 0/3] btrfs-progs: Split original mode check to its own

2018-03-09 Thread David Sterba
On Tue, Feb 06, 2018 at 03:02:20PM +0800, Qu Wenruo wrote:
> This time, there are 2 patches too large to reach mail list, so please
> fetch the whole patchset from github as usual:
> https://github.com/adam900710/btrfs-progs/tree/split_check_part2

I missed this patch series, sorry. The changes look good, there
are unfortunatelly merge conflicts due to the asan/ubsan fixes that
should be trivial to resolve. Can you please rebase the branch and let
mek now? Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs-progs: Limit inline extent below page size

2018-03-09 Thread David Sterba
On Thu, Mar 01, 2018 at 10:47:44AM +0800, Qu Wenruo wrote:
> Kernel doesn't support to drop extent inside an inlined extent.
> And kernel tends to limit inline extent just below sectorsize, so also
> limit it in btrfs-progs.
> 
> This fixes unexpected -EOPNOTSUPP error from __btrfs_drop_extents() on
> converted btrfs.
> 
> Fixes: 806528b8755f ("Add Yan Zheng's ext3->btrfs conversion program")
> Reported-by: Peter Y. Chuang 
> Signed-off-by: Qu Wenruo 

This patch breaks test fsck/020-extent-ref-cases for image
inline_regular_coexist.img .
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-09 Thread Austin S. Hemmelgarn

On 2018-03-09 11:02, Paul Richards wrote:

Hello there,

I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive.
Before I attempt any recovery I’d like to ask what is the recommended
approach?  (The wiki docs suggest consulting here before attempting
recovery[1].)

The system is powered down currently and a replacement drive is being
delivered soon.

Should I use “replace”, or “add” and “delete”?

Once replaced should I rebalance and/or scrub?

I believe that the recovery may involve mounting in degraded mode.  If
I do this, how do I later get out of degraded mode, or if it’s
automatic how do i determine when I’m out of degraded mode?

It won't automatically mount degraded, you either have to explicitly ask 
it to, or you have to have an option to do so in your default mount 
options for the volume in /etc/fstab (which is dangerous for multiple 
reasons).


Now, as to what the best way to go about this is, there are three things 
to consider:


1. Is the failed disk still usable enough that you can get good data off 
of it in a reasonable amount of time?  If you're replacing the disk 
because of a lot of failed sectors, you can still probably get data off 
of it, while something like a head crash isn't worth trying to get data 
back.
2. Do you have enough room in the system itself to add another disk 
without removing one?

3. Is the replacement disk at least as big as the failed disk?

If the answer to all three is yes, then just put in the new disk, mount 
the volume normally (you don't need to mount it degraded if the failed 
disk is working this well), and use `btrfs replace` to move the data. 
This is the most efficient option in terms of both time and is also 
generally the safest (and I personally always over-spec drive-bays in 
systems we build where I work specifically so that this approach can be 
used).


If the answer to the third question is no, put in the new disk (removing 
the failed one first if the answer to the second question is no), mount 
the volume (mount it degraded if one of the first two questions is no, 
normally otherwise), then add the new disk to the volume with `btrfs 
device add` and remove the old one with `btrfs device delete` (using the 
'missing' option if you had to remove the failed disk).  This is needed 
because the replace operation requires the new device to be at least as 
big as the old one.


If the answer to either one or two is no but the answer to three is yes, 
pull out the failed disk, put in a new one, mount the volume degraded, 
and use `btrfs replace` as well (you will need to specify the device ID 
for the now missing failed disk, which you can find by calling `btrfs 
filesystem show` on the volume).  In the event that the replace 
operation refuses to run in this case, instead add the new disk to the 
volume with `btrfs device add` and then run `btrfs device delete 
missing` on the volume.


If you follow any of the above procedures, you don't need to balance 
(the replace operation is equivalent to a block level copy and will 
result in data being distributed exactly the same as it was before, 
while the delete operation is a special type of balance), and you 
generally don't need to scrub the volume either (though it may still be 
a good idea).  As far as getting back from degraded mode, you can just 
remount the volume to do so, though I would generally suggest rebooting.


Note that there are three other possible approaches to consider as well:

1. If you can't immediately get a new disk _and_ all the data will fit 
on the other two disks, use `btrfs device delete` to remove the failed 
disk anyway, and run with just the two until you can get a new disk. 
This is exponentially safer than running the volume degraded until you 
get a new disk, and is the only case you realistically should delete a 
device before adding the new one.  Make sure to balance the volume after 
adding the new device.
2. Depending on the situation, it may be faster to just recreate the 
whole volume from scratch using a backup than it is to try to repair it. 
 This is actually the absolute safest method of handling this 
situation, as it makes sure that nothing from the old volume with the 
failed disk causes problems in the future.
3. If you don't have a backup, but have some temporary storage space 
that will fit all the data from the volume, you could also use `btrfs 
restore` to extract files from the old volume to temporary storage, 
recreate the volume, and copy the data back in from the temporary storage.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Nikolay Borisov


On 23.02.2018 16:28, Martin Svec wrote:
> Hello,
> 
> we have a btrfs-based backup system using btrfs snapshots and rsync. 
> Sometimes,
> we hit ENOSPC bug and the filesystem is remounted read-only. However, there's 
> still plenty of unallocated space according to "btrfs fi usage". So I think 
> this
> isn't another edge condition when btrfs runs out of space due to fragmented 
> chunks,
> but a bug in disk space allocation code. It suffices to umount the filesystem 
> and
> remount it back and it works fine again. The frequency of ENOSPC seems to be
> dependent on metadata chunks usage. When there's a lot of free space in 
> existing
> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
> are
> above ~98%, we hit the bug every few days. Below are details regarding the 
> backup
> server and btrfs.
> 
> The backup works as follows: 
> 
>   * Every night, we create a btrfs snapshot on the backup server and rsync 
> data
> from a production server into it. This snapshot is then marked read-only 
> and
> will be used as a base subvolume for the next backup snapshot.
>   * Every day, expired snapshots are removed and their space is freed. Cleanup
> is scheduled in such a way that it doesn't interfere with the backup 
> window.
>   * Multiple production servers are backed up in parallel to one backup 
> server.
>   * The backed up servers are mostly webhosting servers and mail servers, i.e.
> hundreds of billions of small files. (Yes, we push btrfs to the limits 
> :-))
>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>   * Rsync is configured to use whole file copying.
> 
> System configuration:
> 
> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
> (see below) and
> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
> metadata_reserve_bytes)
> 
> btrfs mount options: 
> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
> 
> $ btrfs fi df /backup:
> 
> Data, single: total=28.05TiB, used=26.37TiB
> System, single: total=32.00MiB, used=3.53MiB
> Metadata, single: total=255.00GiB, used=250.73GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> $ btrfs fi show /backup:
> 
> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
> Total devices 2 FS bytes used 26.62TiB
> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
> 
> $ btrfs fi usage /backup:
> 
> Overall:
> Device size:  36.79TiB
> Device allocated: 28.30TiB
> Device unallocated:8.49TiB
> Device missing:  0.00B
> Used: 26.62TiB
> Free (estimated): 10.17TiB  (min: 10.17TiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,single: Size:28.05TiB, Used:26.37TiB
>/dev/sdc   28.05TiB
> 
> Metadata,single: Size:255.00GiB, Used:250.73GiB
>/dev/sdb  255.00GiB
> 
> System,single: Size:32.00MiB, Used:3.53MiB
>/dev/sdb   32.00MiB
> 
> Unallocated:
>/dev/sdb  161.59GiB
>/dev/sdc8.33TiB
> 
> Btrfs filesystem uses two logical drives in single mode, backed by
> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
> 
> Please note that we have a simple custom patch in btrfs which ensures
> that metadata chunks are allocated preferably on SSD volume and data
> chunks are allocated only on SATA volume. The patch slightly modifies
> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
> devices when a metadata chunk is requested and vice versa. However, 
> I'm quite sure that this patch doesn't cause the reported bug because
> we log every call of the modified code and there're no __btrfs_alloc_chunk()
> calls when ENOSPC is triggered. Moreover, we observed the same bug before
> we developed the patch. (IIRC, Chris Mason mentioned that they work on
> a similar feature in facebook, but I've found no official patches yet.)
> 
> Dmesg dump:
> 
> [285167.750763] use_block_rsv: 62468 callbacks suppressed
> [285167.750764] BTRFS: block rsv returned -28
> [285167.750789] [ cut here ]
> [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 
> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
> ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
> nf_nat nf_conntr
> [285167.750853]  zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse 
> uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
> [285167.750861] CPU: 5 PID: 443 Comm: btrfs-transacti Tainted: GW I   
>   4.14.20-znr1+ #69
> 

Re: [PATCH] btrfs-progs: dump-tree: add degraded option

2018-03-09 Thread David Sterba
On Thu, Feb 22, 2018 at 10:00:11PM +0800, Anand Jain wrote:
> btrfs inspect dump-tree cli picks the disk with the largest generation
> to read the root tree, even when all the devices were not provided in
> the cli. But in 2 disks RAID1 you may need to know what's in the disks
> individually, so this option -x | --degraded indicates to use only the
> given disk to dump.
> 
> Signed-off-by: Anand Jain 
> ---
>  cmds-inspect-dump-tree.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
> index df44bb635c9c..587b6081df0c 100644
> --- a/cmds-inspect-dump-tree.c
> +++ b/cmds-inspect-dump-tree.c
> @@ -198,6 +198,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
>   "-u|--uuid  print only the uuid tree",
>   "-b|--block  print info from the specified block only",
>   "-t|--tree print only tree with the given id (string or 
> number)",
> + "-x|--degraded  For RAID1, use the disk in the arg, do not scan 
> for disks",

I find this confusing, degraded for kernel means 'mount with anything
you have' while here it means 'use exactly that disk'.

Something that would reflect the "don't scan" would be IMHO better, but
otherwise the goal of the patch makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs-progs: check/lowmem: Fix the incorrect error message of check_extent_data_item

2018-03-09 Thread David Sterba
On Thu, Mar 08, 2018 at 04:13:27PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月28日 18:13, Lu Fengqi wrote:
> > Instead of the disk_bytenr and disk_num_bytes of the extent_item which the
> > file extent references, we should output the objectid and offset of the
> > file extent. And the leaf may be shared by the file trees, we should print
> > the objectid of the root and the owner of the leaf.
> > 
> > Fixes: b0d360b541f0 ("btrfs-progs: check: introduce function to check data 
> > backref in extent tree")
> > Signed-off-by: Lu Fengqi 
> > ---
> > V2: Output the objectid of the root and the owner of the leaf.
> > 
> >  check/mode-lowmem.c | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
> > index 62bcf3d2e126..f37b1b2c1571 100644
> > --- a/check/mode-lowmem.c
> > +++ b/check/mode-lowmem.c
> > @@ -2631,9 +2631,9 @@ static int check_extent_data_item(struct btrfs_root 
> > *root,
> >  
> > if (!(extent_flags & BTRFS_EXTENT_FLAG_DATA)) {
> > error(
> > -   "extent[%llu %llu] backref type mismatch, wanted bit: %llx",
> > -   disk_bytenr, disk_num_bytes,
> > -   BTRFS_EXTENT_FLAG_DATA);
> > +"file extent[%llu %llu] root %llu owner %llu backref type mismatch, wanted 
> > bit: %llx",
> > +   fi_key.objectid, fi_key.offset, root->objectid, owner,
> 
> Indeed this is much easier to identify the problem.
> 
> Reviewed-by: Qu Wenruo 

1-3 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-09 Thread Paul Richards
Hello there,

I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive.
Before I attempt any recovery I’d like to ask what is the recommended
approach?  (The wiki docs suggest consulting here before attempting
recovery[1].)

The system is powered down currently and a replacement drive is being
delivered soon.

Should I use “replace”, or “add” and “delete”?

Once replaced should I rebalance and/or scrub?

I believe that the recovery may involve mounting in degraded mode.  If
I do this, how do I later get out of degraded mode, or if it’s
automatic how do i determine when I’m out of degraded mode?


The system is running Ubuntu 16.04 LTS with kernel 4.4.0.



1: “it may be helpful to consult the mailing list of irc channel
before attempting recovery”. -
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Replacing_failed_devices
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: dump-super: Don't verify csum if csum type or size is unknown

2018-03-09 Thread David Sterba
On Tue, Mar 06, 2018 at 10:16:51AM +0800, Qu Wenruo wrote:
> Reported-by: Ken Swenson 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: free-space-cache: Enhance free space cache free space check

2018-03-09 Thread David Sterba
On Thu, Mar 08, 2018 at 03:02:31PM +0800, Qu Wenruo wrote:
> When we found free space difference between free space cache and block
> group item, we just discard this free space cache.
> 
> Normally such difference is caused by btrfs_reserve_extent() called by
> delalloc which is out of a transaction.
> And since all btrfs_release_extent() is called with a transaction, under
> heavy race free space cache can have less free space than block group
> item.
> 
> Normally kernel will detect such difference and just discard that cache.
> 
> However we must be more careful if free space cache has more free space
> cache, and if that happens, paried with above race one invalid free
> space cache can be loaded into kernel.
> 
> So if we find any free space cache who has more free space then block
> group item, we report it as an error other than ignoring it.
> 
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Fix the timming of free space output.

I'll apply the v2 patch, thanks. The message can be updated with manual
page together.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: free-space-cache: Enhance free space cache free space check

2018-03-09 Thread David Sterba
On Fri, Mar 09, 2018 at 10:06:02AM +0200, Nikolay Borisov wrote:
> >>> +  * space cache has less free space, and both kernel just discard
> >>> +  * such cache. But if we find some case where free space cache
> >>> +  * has more free space, this means under certain case such
> >>> +  * cache can be loaded and cause double allocate.
> >>> +  *
> >>> +  * Detect such possibility here.
> >>> +  */
> >>> + if (diff > 0)
> >>> + error(
> >>> +"free space cache has more free space than block group item, this could 
> >>> leads to serious corruption, please contact btrfs developers");
> >>
> >> I'm not entirely happy with this message. So they will post to the
> >> mailing list saying something along the lines of "I got this message
> >> what do I do no, please help".  Better to output actionable data so that
> >> the user can post it immediately.
> > 
> > Unfortunately, this is already the situation we don't expect to see.
> > 
> > What we really need is to know this could happen, and if possible some
> > info about the situation.
> > There is not much actionable data here.
> 
> Fair enough, at the very least I think we should put information how to
> contact btrfs developers. So put the address of the mailing list, I
> don't think it's safe to assume people will be aware of it .

This should go to the manual page or help string, and reference it from
here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Nikolay Borisov


On  9.03.2018 17:04, Martin Svec wrote:
> Nobody knows?
> 
> I'm particularly interested why debug space_info 4 shows negative (unsigned 
> 18446744072120172544)
> value as free metadata space, please see the original report. Is it a bug in 
> dump_space_info(), or
> metadata reservations can temporarily exceed the total space, or is it an 
> indication of a damaged
> filesystem? Also note that rebuilding free space cache doesn't help.

It's a bug in dump_space_info, mainly caused by the fact that when you
add may_use_bytes to the curently used space i.e. used + pinned +
may_use + reserved it usually is larger than total.

As I said in my reply to Alex Adriaanse (I assume you work on the same
system) what transpires is that you are running out of space in a
context which essentially doesn't allow to free any more space.


> 
> Thank you.
> 
> Martin
> 
> Dne 23.2.2018 v 15:28 Martin Svec napsal(a):
>> Hello,
>>
>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>> Sometimes,
>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>> there's 
>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>> this
>> isn't another edge condition when btrfs runs out of space due to fragmented 
>> chunks,
>> but a bug in disk space allocation code. It suffices to umount the 
>> filesystem and
>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>> dependent on metadata chunks usage. When there's a lot of free space in 
>> existing
>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>> are
>> above ~98%, we hit the bug every few days. Below are details regarding the 
>> backup
>> server and btrfs.
>>
>> The backup works as follows: 
>>
>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>> data
>> from a production server into it. This snapshot is then marked read-only 
>> and
>> will be used as a base subvolume for the next backup snapshot.
>>   * Every day, expired snapshots are removed and their space is freed. 
>> Cleanup
>> is scheduled in such a way that it doesn't interfere with the backup 
>> window.
>>   * Multiple production servers are backed up in parallel to one backup 
>> server.
>>   * The backed up servers are mostly webhosting servers and mail servers, 
>> i.e.
>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>> :-))
>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>   * Rsync is configured to use whole file copying.
>>
>> System configuration:
>>
>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>> (see below) and
>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>> metadata_reserve_bytes)
>>
>> btrfs mount options: 
>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>
>> $ btrfs fi df /backup:
>>
>> Data, single: total=28.05TiB, used=26.37TiB
>> System, single: total=32.00MiB, used=3.53MiB
>> Metadata, single: total=255.00GiB, used=250.73GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> $ btrfs fi show /backup:
>>
>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>> Total devices 2 FS bytes used 26.62TiB
>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>
>> $ btrfs fi usage /backup:
>>
>> Overall:
>> Device size:  36.79TiB
>> Device allocated: 28.30TiB
>> Device unallocated:8.49TiB
>> Device missing:  0.00B
>> Used: 26.62TiB
>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>> Data ratio:   1.00
>> Metadata ratio:   1.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Data,single: Size:28.05TiB, Used:26.37TiB
>>/dev/sdc   28.05TiB
>>
>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>/dev/sdb  255.00GiB
>>
>> System,single: Size:32.00MiB, Used:3.53MiB
>>/dev/sdb   32.00MiB
>>
>> Unallocated:
>>/dev/sdb  161.59GiB
>>/dev/sdc8.33TiB
>>
>> Btrfs filesystem uses two logical drives in single mode, backed by
>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>
>> Please note that we have a simple custom patch in btrfs which ensures
>> that metadata chunks are allocated preferably on SSD volume and data
>> chunks are allocated only on SATA volume. The patch slightly modifies
>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>> devices when a metadata chunk is requested and vice versa. However, 
>> I'm quite sure that this patch doesn't cause the reported bug because
>> we log every call of the modified code and there're no __btrfs_alloc_chunk()
>> calls when ENOSPC is triggered. Moreover, we 

Re: [PATCH v2] Btrfs: scrub: batch rebuild for raid56

2018-03-09 Thread David Sterba
On Wed, Mar 07, 2018 at 12:08:09PM -0700, Liu Bo wrote:
> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> process may be triggered on every block on a same stripe.
> 
> A typical example would be that when we're replacing a disappeared disk,
> all reads on the disks get -EIO, every block (size is 4K if blocksize is
> 4K) would go thru these,
> 
> scrub_handle_errored_block
>   scrub_recheck_block # re-read pages one by one
>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> page by page
> 
> Although with raid56 stripe cache most of reads during rebuild can be
> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> be done $(BTRFS_STRIPE_LEN / blocksize) times.
> 
> This makes it smarter by doing raid56 scrub/replace on stripe length.
> 
> Signed-off-by: Liu Bo 
> ---
> v2: - Place bio allocation in code statement.
> - Get rid of bio_set_op_attrs.
> - Add SOB.

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Nobody knows?

I'm particularly interested why debug space_info 4 shows negative (unsigned 
18446744072120172544)
value as free metadata space, please see the original report. Is it a bug in 
dump_space_info(), or
metadata reservations can temporarily exceed the total space, or is it an 
indication of a damaged
filesystem? Also note that rebuilding free space cache doesn't help.

Thank you.

Martin

Dne 23.2.2018 v 15:28 Martin Svec napsal(a):
> Hello,
>
> we have a btrfs-based backup system using btrfs snapshots and rsync. 
> Sometimes,
> we hit ENOSPC bug and the filesystem is remounted read-only. However, there's 
> still plenty of unallocated space according to "btrfs fi usage". So I think 
> this
> isn't another edge condition when btrfs runs out of space due to fragmented 
> chunks,
> but a bug in disk space allocation code. It suffices to umount the filesystem 
> and
> remount it back and it works fine again. The frequency of ENOSPC seems to be
> dependent on metadata chunks usage. When there's a lot of free space in 
> existing
> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
> are
> above ~98%, we hit the bug every few days. Below are details regarding the 
> backup
> server and btrfs.
>
> The backup works as follows: 
>
>   * Every night, we create a btrfs snapshot on the backup server and rsync 
> data
> from a production server into it. This snapshot is then marked read-only 
> and
> will be used as a base subvolume for the next backup snapshot.
>   * Every day, expired snapshots are removed and their space is freed. Cleanup
> is scheduled in such a way that it doesn't interfere with the backup 
> window.
>   * Multiple production servers are backed up in parallel to one backup 
> server.
>   * The backed up servers are mostly webhosting servers and mail servers, i.e.
> hundreds of billions of small files. (Yes, we push btrfs to the limits 
> :-))
>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>   * Rsync is configured to use whole file copying.
>
> System configuration:
>
> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
> (see below) and
> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
> metadata_reserve_bytes)
>
> btrfs mount options: 
> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>
> $ btrfs fi df /backup:
>
> Data, single: total=28.05TiB, used=26.37TiB
> System, single: total=32.00MiB, used=3.53MiB
> Metadata, single: total=255.00GiB, used=250.73GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> $ btrfs fi show /backup:
>
> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
> Total devices 2 FS bytes used 26.62TiB
> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>
> $ btrfs fi usage /backup:
>
> Overall:
> Device size:  36.79TiB
> Device allocated: 28.30TiB
> Device unallocated:8.49TiB
> Device missing:  0.00B
> Used: 26.62TiB
> Free (estimated): 10.17TiB  (min: 10.17TiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:28.05TiB, Used:26.37TiB
>/dev/sdc   28.05TiB
>
> Metadata,single: Size:255.00GiB, Used:250.73GiB
>/dev/sdb  255.00GiB
>
> System,single: Size:32.00MiB, Used:3.53MiB
>/dev/sdb   32.00MiB
>
> Unallocated:
>/dev/sdb  161.59GiB
>/dev/sdc8.33TiB
>
> Btrfs filesystem uses two logical drives in single mode, backed by
> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>
> Please note that we have a simple custom patch in btrfs which ensures
> that metadata chunks are allocated preferably on SSD volume and data
> chunks are allocated only on SATA volume. The patch slightly modifies
> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
> devices when a metadata chunk is requested and vice versa. However, 
> I'm quite sure that this patch doesn't cause the reported bug because
> we log every call of the modified code and there're no __btrfs_alloc_chunk()
> calls when ENOSPC is triggered. Moreover, we observed the same bug before
> we developed the patch. (IIRC, Chris Mason mentioned that they work on
> a similar feature in facebook, but I've found no official patches yet.)
>
> Dmesg dump:
>
> [285167.750763] use_block_rsv: 62468 callbacks suppressed
> [285167.750764] BTRFS: block rsv returned -28
> [285167.750789] [ cut here ]
> [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 
> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
> iptable_filter nf_conntrack_ipv6 

Re: dmesg flooded with "Very big device. Trying to use READ CAPACITY(16)" with 8TB HDDs

2018-03-09 Thread David Sterba
On Thu, Mar 08, 2018 at 12:18:04PM +0100, Menion wrote:
> Actually this path can be taken in few occurrency
> 
> 1) device probe, only when the device is plugged or detected the first time
> 2) revalidate_disk fops of block device
> 
> Is it possible that BTRFS every 5 minutes call the revalidate_disk?

An idea: udev or blkid cache refresh, triggers 'btrfs dev scan' that
calls blkdev_get_by_path that could in turn call the revalidation and
whatnot.

Alternatively you can patch the code and add a WARN_ON right after the
message, the stacktrace will tell for sure from where it gets triggered.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-09 Thread David Sterba
On Wed, Mar 07, 2018 at 05:22:08PM +0200, Nikolay Borisov wrote:
> 
> 
> On  7.03.2018 16:43, David Sterba wrote:
> > On Tue, Mar 06, 2018 at 11:22:21AM -0700, Liu Bo wrote:
> >> On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
> >>> On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
>  In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
>  as unit, however, scrub_extent() sets blocksize as unit, so rebuild
>  process may be triggered on every block on a same stripe.
> 
>  A typical example would be that when we're replacing a disappeared disk,
>  all reads on the disks get -EIO, every block (size is 4K if blocksize is
>  4K) would go thru these,
> 
>  scrub_handle_errored_block
>    scrub_recheck_block # re-read pages one by one
>    scrub_recheck_block # rebuild by calling raid56_parity_recover()
>  page by page
> 
>  Although with raid56 stripe cache most of reads during rebuild can be
>  avoided, the parity recover calculation(xor or raid6 algorithms) needs to
>  be done $(BTRFS_STRIPE_LEN / blocksize) times.
> 
>  This makes it less stupid by doing raid56 scrub/replace on stripe length.
> >>>
> >>> missing s-o-b
> >>
> >> I'm surprised that checkpatch.pl didn't complain.
> 
> I have written a python script that can scrape the mailing list and run
> checkpatch (and any other software deemed appropriate) on posted patches
> and reply back with results. However, I haven't really activated it, I
> guess if people think there is merit in it I could hook it up to the
> mailing list :)

If we and checkpatch agree on the issues to report then yes, but I think
it would be confusing when we'd have to tell people to ignore some of
the warnings. I think we'd need to tune checkpatch for our needs, I
would not mind testing it locally. I'd start with catching typos in the
changelogs and comments, that's something that happens all the time and
can be automated. Next, implement AI that will try to understand the
changelog and tell people what's missing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: sort and group mount option definitions

2018-03-09 Thread David Sterba
Sort mount options by the primary name, followed by the 'no-'
counterpart if it exists. Group the deprecated and debugging options.
Enum and token defintions are synced.

Signed-off-by: David Sterba 
---
 fs/btrfs/super.c | 139 ++-
 1 file changed, 86 insertions(+), 53 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 121b9d40ff8f..d41d5960ef4a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -308,21 +308,50 @@ static void btrfs_put_super(struct super_block *sb)
 }
 
 enum {
-   Opt_degraded, Opt_subvol, Opt_subvolid, Opt_device, Opt_nodatasum,
-   Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd,
-   Opt_nossd, Opt_ssd_spread, Opt_nossd_spread, Opt_thread_pool, Opt_noacl,
-   Opt_compress, Opt_compress_type, Opt_compress_force,
-   Opt_compress_force_type, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
-   Opt_discard, Opt_space_cache, Opt_space_cache_version, Opt_clear_cache,
-   Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid,
-   Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
-   Opt_skip_balance, Opt_check_integrity,
+   Opt_acl, Opt_noacl,
+   Opt_clear_cache,
+   Opt_commit_interval,
+   Opt_compress,
+   Opt_compress_force,
+   Opt_compress_force_type,
+   Opt_compress_type,
+   Opt_degraded,
+   Opt_device,
+   Opt_fatal_errors,
+   Opt_flushoncommit, Opt_noflushoncommit,
+   Opt_inode_cache, Opt_noinode_cache,
+   Opt_max_inline,
+   Opt_barrier, Opt_nobarrier,
+   Opt_datacow, Opt_nodatacow,
+   Opt_datasum, Opt_nodatasum,
+   Opt_defrag, Opt_nodefrag,
+   Opt_discard, Opt_nodiscard,
+   Opt_nologreplay,
+   Opt_norecovery,
+   Opt_ratio,
+   Opt_rescan_uuid_tree,
+   Opt_skip_balance,
+   Opt_space_cache, Opt_no_space_cache,
+   Opt_space_cache_version,
+   Opt_ssd, Opt_nossd,
+   Opt_ssd_spread, Opt_nossd_spread,
+   Opt_subvol,
+   Opt_subvolid,
+   Opt_thread_pool,
+   Opt_treelog, Opt_notreelog,
+   Opt_usebackuproot,
+   Opt_user_subvol_rm_allowed,
+
+   /* Deprecated options */
+   Opt_alloc_start,
+   Opt_recovery,
+   Opt_subvolrootid,
+
+   /* Debugging options */
+   Opt_check_integrity,
Opt_check_integrity_including_extent_data,
-   Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
-   Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
-   Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-   Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot,
-   Opt_nologreplay, Opt_norecovery,
+   Opt_check_integrity_print_mask,
+   Opt_enospc_debug, Opt_noenospc_debug,
 #ifdef CONFIG_BTRFS_DEBUG
Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
 #endif
@@ -333,59 +362,63 @@ enum {
 };
 
 static const match_table_t tokens = {
-   {Opt_degraded, "degraded"},
-   {Opt_subvol, "subvol=%s"},
-   {Opt_subvolid, "subvolid=%s"},
-   {Opt_device, "device=%s"},
-   {Opt_nodatasum, "nodatasum"},
-   {Opt_datasum, "datasum"},
-   {Opt_nodatacow, "nodatacow"},
-   {Opt_datacow, "datacow"},
-   {Opt_nobarrier, "nobarrier"},
-   {Opt_barrier, "barrier"},
-   {Opt_max_inline, "max_inline=%s"},
-   {Opt_alloc_start, "alloc_start=%s"},/* deprecated */
-   {Opt_thread_pool, "thread_pool=%u"},
+   {Opt_acl, "acl"},
+   {Opt_noacl, "noacl"},
+   {Opt_clear_cache, "clear_cache"},
+   {Opt_commit_interval, "commit=%u"},
{Opt_compress, "compress"},
{Opt_compress_type, "compress=%s"},
{Opt_compress_force, "compress-force"},
{Opt_compress_force_type, "compress-force=%s"},
-   {Opt_ssd, "ssd"},
-   {Opt_ssd_spread, "ssd_spread"},
-   {Opt_nossd, "nossd"},
-   {Opt_nossd_spread, "nossd_spread"},
-   {Opt_acl, "acl"},
-   {Opt_noacl, "noacl"},
-   {Opt_notreelog, "notreelog"},
-   {Opt_treelog, "treelog"},
-   {Opt_nologreplay, "nologreplay"},
-   {Opt_norecovery, "norecovery"},
+   {Opt_degraded, "degraded"},
+   {Opt_device, "device=%s"},
+   {Opt_fatal_errors, "fatal_errors=%s"},
{Opt_flushoncommit, "flushoncommit"},
{Opt_noflushoncommit, "noflushoncommit"},
-   {Opt_ratio, "metadata_ratio=%u"},
+   {Opt_inode_cache, "inode_cache"},
+   {Opt_noinode_cache, "noinode_cache"},
+   {Opt_max_inline, "max_inline=%s"},
+   {Opt_barrier, "barrier"},
+   {Opt_nobarrier, "nobarrier"},
+   {Opt_datacow, "datacow"},
+   {Opt_nodatacow, "nodatacow"},
+   {Opt_datasum, "datasum"},
+   {Opt_nodatasum, "nodatasum"},
+   {Opt_defrag, "autodefrag"},
+   {Opt_nodefrag, "noautodefrag"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
+   

Re: [PATCH v2] btrfs: Add nossd_spread mount option

2018-03-09 Thread David Sterba
On Thu, Mar 08, 2018 at 10:48:48AM -0800, Howard McLauchlan wrote:
> Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
> Presently there is an option to disable all SSD optimizations, but there
> isn't an option to disable just ssd_spread.
> 
> This patch adds a mount option nossd_spread that disables ssd_spread
> only.
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Howard McLauchlan 
> ---
> V2:remove duplicate code with fallthrough
> Based on 4.16-rc4

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-09 Thread Nikolay Borisov

> Sorry, I clearly missed that one. I have applied the patch you referenced and 
> rebooted the VM in question. This morning we had another FS failure on the 
> same machine that caused it to go into readonly mode. This happened after 
> that device was experiencing 100% I/O utilization for some time. No balance 
> was running at the time; last balance finished about 6 hours prior to the 
> error.
> 
> Kernel messages:
> [211238.262683] use_block_rsv: 163 callbacks suppressed
> [211238.262683] BTRFS: block rsv returned -28
> [211238.266718] [ cut here ]
> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
> drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
> drm ena scsi_mod i2c_piix4 button
> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: GW 
>   4.14.13 #3
> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [211238.330742] task: 9cb43abb70c0 task.stack: b234c3b58000
> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.340454] RSP: 0018:b234c3b5b958 EFLAGS: 00010282
> [211238.344782] RAX: 001d RBX: 9cb43bdea128 RCX: 
> 
> [211238.350562] RDX:  RSI: 9cb440a166f8 RDI: 
> 9cb440a166f8
> [211238.356066] RBP: 4000 R08: 0001 R09: 
> 7d81
> [211238.361649] R10: 0001 R11: 7d81 R12: 
> 9cb43bdea000
> [211238.367304] R13: 9cb437f2c800 R14: 0001 R15: 
> ffe4
> [211238.372658] FS:  () GS:9cb440a0() 
> knlGS:
> [211238.379048] CS:  0010 DS:  ES:  CR0: 80050033
> [211238.384681] CR2: 7f90a6677000 CR3: 0003cea0a006 CR4: 
> 001606f0
> [211238.391380] DR0:  DR1:  DR2: 
> 
> [211238.398050] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [211238.404730] Call Trace:
> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [211238.442899]  ? remove_wait_queue+0x60/0x60
> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
> [211238.450578]  kthread+0xfc/0x130
> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [211238.458381]  ? kthread_create_on_node+0x70/0x70
> [211238.462225]  ? do_group_exit+0x3a/0xa0
> [211238.465586]  ret_from_fork+0x1f/0x30
> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 
> f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 
> <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
> [211238.486524] BTRFS info (device xvdc): space_info 4 has 
> 18446744073258958848 free, is not full
> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, 
> used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, 
> readonly=65536

Ok so the numbers here are helpful, they show that we have enough space
to allocate a chunk. I've also looked at the logic in 4.14.13 and all
the necessary patches are there. Unfortunately none of this matters due
to the fact that reserve_metadata_bytes is being called with
BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
any effort to flush anything at all.

Can you tell again what the workload is - is it some sort of a database,
constantly writing to its files? If so, btrfs is not really suited for
rewrite-heavy workloads since this causes excessive CoW. In such cases
you really ought to set nodatacow on the specified files. For more
information:

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

The other thing that comes to mind is to try and tune the default commit
interval. Currently this is 30 seconds, meaning a 

Re: [PATCH 1/2] Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero

2018-03-09 Thread Nikolay Borisov


On  9.03.2018 11:01, robbieko wrote:
> Nikolay Borisov 於 2018-03-07 19:15 寫到:
>> On  7.03.2018 12:27, robbieko wrote:
>>> Nikolay Borisov 於 2018-03-07 18:19 寫到:
 On  7.03.2018 10:20, robbieko wrote:
> From: Robbie Ko 
>
>  # mount /dev/vdb5 /mnt/btrfs
>  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/btrfs/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>    0: [0..127]:    25088..25215   128   0x1
>
> Run fiemap with fm_extent_count set to 0, we'll get wrong value 4
> instead of 1.

 Wrong value 4 instead of 1 for which exact column, the flags? State
 this
 explicitly.

 Also this seems a bit bogus since fiemap's documentation states:

 If fm_extent_count is zero, then the fm_extents[] array is ignored (no
 extents will be returned), and the fm_mapped_extents count will hold
 the
 number of extents needed in fm_extents[] to hold the file's current
 mapping.

 So when fm_extent_count we shouldn't really be returning anything from
 kernel.

>>>
>>> Sorry I did not explain clearly.
>>> The value is fm_mapped_extents.
>>
>> But fm_mapped_extents is tagged as an OUT member, meaning the user has
>> no job writing to it.
>>
> 
> 
> [BUG]
> fm_mapped_extents is not correct when fm_extent_count is 0
> Like:
>   # mount /dev/vdb5 /mnt/btrfs
>   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>   # xfs_io -c "fiemap -v" /mnt/btrfs/file
>   /mnt/btrfs/file:
>   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>     0: [0..127]:    25088..25215   128   0x1
> 
> When user space wants to get the number of file extents,
> set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.
> 
> In the above example, fiemap will return with fm_mapped_extents set to 4 ,
> but it should be 1 since there's only one entry in the output.

This is slightly better.

> 
> [REASON]
> When fm_extent_count is 0, disko is not initialized correctly,
> The value is 0 in this case, not the right bytenr.

The value of what, be more explicit.

"
Also the problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.
"


You see how complicated in fact the issue is, and your 2 line
explanation is not really making it clear.

> It will cause the fiemap merge mechanism to fail.


> 
> [FIX]
> Use correct disko.

This is as uninformative as it gets... What about something like :

"Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries."

> 
> Thanks.
> Robbie Ko
> 
>>> If fm_extent_count  is zero, the fm_mapped_extents count will hold the
>>> number of extents needed.
>>>
>>>

>
> [REASON]
> When fm_extent_count is 0, disko is not initialized correctly,
> The value is 0 in this case, not the right bytenr.

 This is too sparse, be more explicit i.e. that disko=0 is passed to
 emit_fiemap_extent which then leads to issues.

>
> [FIX]
> Use correct disko.
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/extent_io.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 012d638..066b6df 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4567,7 +4567,7 @@ int extent_fiemap(struct inode *inode, struct
> fiemap_extent_info *fieinfo,
>  offset_in_extent = em_start - em->start;
>  em_end = extent_map_end(em);
>  em_len = em_end - em_start;
> -    disko = 0;
> +    disko = em->block_start + offset_in_extent;
>  flags = 0;
>
>  /*
> @@ -4590,8 +4590,6 @@ int extent_fiemap(struct inode *inode, struct
> fiemap_extent_info *fieinfo,
>  u64 bytenr = em->block_start -
>  (em->start - em->orig_start);
>
> -    disko = em->block_start + offset_in_extent;
> -
>  /*
>   * As btrfs supports shared space, this information
>   * can be exported to userspace tools via
> -- 
> 1.9.1
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>>>
>>> 

Re: [PATCH 1/2] Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero

2018-03-09 Thread robbieko

Nikolay Borisov 於 2018-03-07 19:15 寫到:

On  7.03.2018 12:27, robbieko wrote:

Nikolay Borisov 於 2018-03-07 18:19 寫到:

On  7.03.2018 10:20, robbieko wrote:

From: Robbie Ko 

 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/btrfs/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:    25088..25215   128   0x1

Run fiemap with fm_extent_count set to 0, we'll get wrong value 4
instead of 1.


Wrong value 4 instead of 1 for which exact column, the flags? State 
this

explicitly.

Also this seems a bit bogus since fiemap's documentation states:

If fm_extent_count is zero, then the fm_extents[] array is ignored 
(no
extents will be returned), and the fm_mapped_extents count will hold 
the

number of extents needed in fm_extents[] to hold the file's current
mapping.

So when fm_extent_count we shouldn't really be returning anything 
from

kernel.



Sorry I did not explain clearly.
The value is fm_mapped_extents.


But fm_mapped_extents is tagged as an OUT member, meaning the user has
no job writing to it.




[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
  # mount /dev/vdb5 /mnt/btrfs
  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/btrfs/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..127]:25088..25215   128   0x1

When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.

In the above example, fiemap will return with fm_mapped_extents set to 4 
,

but it should be 1 since there's only one entry in the output.

[REASON]
When fm_extent_count is 0, disko is not initialized correctly,
The value is 0 in this case, not the right bytenr.
It will cause the fiemap merge mechanism to fail.

[FIX]
Use correct disko.

Thanks.
Robbie Ko


If fm_extent_count  is zero, the fm_mapped_extents count will hold the
number of extents needed.






[REASON]
When fm_extent_count is 0, disko is not initialized correctly,
The value is 0 in this case, not the right bytenr.


This is too sparse, be more explicit i.e. that disko=0 is passed to
emit_fiemap_extent which then leads to issues.



[FIX]
Use correct disko.

Signed-off-by: Robbie Ko 
---
 fs/btrfs/extent_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 012d638..066b6df 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4567,7 +4567,7 @@ int extent_fiemap(struct inode *inode, struct
fiemap_extent_info *fieinfo,
 offset_in_extent = em_start - em->start;
 em_end = extent_map_end(em);
 em_len = em_end - em_start;
-    disko = 0;
+    disko = em->block_start + offset_in_extent;
 flags = 0;

 /*
@@ -4590,8 +4590,6 @@ int extent_fiemap(struct inode *inode, struct
fiemap_extent_info *fieinfo,
 u64 bytenr = em->block_start -
 (em->start - em->orig_start);

-    disko = em->block_start + offset_in_extent;
-
 /*
  * As btrfs supports shared space, this information
  * can be exported to userspace tools via
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: free-space-cache: Enhance free space cache free space check

2018-03-09 Thread Nikolay Borisov


On  9.03.2018 01:27, Qu Wenruo wrote:
> 
> 
> On 2018年03月08日 22:05, Nikolay Borisov wrote:
>>
>>
>> On  8.03.2018 09:02, Qu Wenruo wrote:
>>> When we found free space difference between free space cache and block
>>> group item, we just discard this free space cache.
>>>
>>> Normally such difference is caused by btrfs_reserve_extent() called by
>>> delalloc which is out of a transaction.
>>> And since all btrfs_release_extent() is called with a transaction, under
>>> heavy race free space cache can have less free space than block group
>>> item.
>>>
>>> Normally kernel will detect such difference and just discard that cache.
>>>
>>> However we must be more careful if free space cache has more free space
>>> cache, and if that happens, paried with above race one invalid free
>>> space cache can be loaded into kernel.
>>>
>>> So if we find any free space cache who has more free space then block
>>> group item, we report it as an error other than ignoring it.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> v2:
>>>   Fix the timming of free space output.
>>> ---
>>>  check/main.c   |  4 +++-
>>>  free-space-cache.c | 32 
>>>  2 files changed, 27 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/check/main.c b/check/main.c
>>> index 97baae583f04..bc31f7e32061 100644
>>> --- a/check/main.c
>>> +++ b/check/main.c
>>> @@ -5339,7 +5339,9 @@ static int check_space_cache(struct btrfs_root *root)
>>> error += ret;
>>> } else {
>>> ret = load_free_space_cache(root->fs_info, cache);
>>> -   if (!ret)
>>> +   if (ret < 0)
>>> +   error++;
>>> +   if (ret <= 0)
>>> continue;
>>> }
>>>  
>>> diff --git a/free-space-cache.c b/free-space-cache.c
>>> index f933f9f1cf3f..9b83a71ca59a 100644
>>> --- a/free-space-cache.c
>>> +++ b/free-space-cache.c
>>> @@ -438,7 +438,8 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info,
>>> struct btrfs_path *path;
>>> u64 used = btrfs_block_group_used(_group->item);
>>> int ret = 0;
>>> -   int matched;
>>> +   u64 bg_free;
>>> +   s64 diff;
>>>  
>>> path = btrfs_alloc_path();
>>> if (!path)
>>> @@ -448,18 +449,33 @@ int load_free_space_cache(struct btrfs_fs_info 
>>> *fs_info,
>>>   block_group->key.objectid);
>>> btrfs_free_path(path);
>>>  
>>> -   matched = (ctl->free_space == (block_group->key.offset - used -
>>> -  block_group->bytes_super));
>>> -   if (ret == 1 && !matched) {
>>> -   __btrfs_remove_free_space_cache(ctl);
>>> +   bg_free = block_group->key.offset - used - block_group->bytes_super;
>>> +   diff = ctl->free_space - bg_free;
>>> +   if (ret == 1 && diff) {
>>> fprintf(stderr,
>>> -  "block group %llu has wrong amount of free space\n",
>>> -  block_group->key.objectid);
>>> +  "block group %llu has wrong amount of free space, free 
>>> space cache has %llu block group has %llu\n",nit: Always put units when 
>>> printing numbers. In this case we are talking
>> about bytes.
>>> +  block_group->key.objectid, ctl->free_space, bg_free);
>>> +   __btrfs_remove_free_space_cache(ctl);
>>> +   /*
>>> +* Due to btrfs_reserve_extent() can happen out of > +  
>>>  * transaction, but all btrfs_release_extent() happens inside
>>> +* a transaction, so under heavy race it's possible that free
>>> +* space cache has less free space, and both kernel just discard
>>> +* such cache. But if we find some case where free space cache
>>> +* has more free space, this means under certain case such
>>> +* cache can be loaded and cause double allocate.
>>> +*
>>> +* Detect such possibility here.
>>> +*/
>>> +   if (diff > 0)
>>> +   error(
>>> +"free space cache has more free space than block group item, this could 
>>> leads to serious corruption, please contact btrfs developers");
>>
>> I'm not entirely happy with this message. So they will post to the
>> mailing list saying something along the lines of "I got this message
>> what do I do no, please help".  Better to output actionable data so that
>> the user can post it immediately.
> 
> Unfortunately, this is already the situation we don't expect to see.
> 
> What we really need is to know this could happen, and if possible some
> info about the situation.
> There is not much actionable data here.

Fair enough, at the very least I think we should put information how to
contact btrfs developers. So put the address of the mailing list, I
don't think it's safe to assume people will be aware of it .

> 
> Thanks,
> Qu
> 
>>
>>> ret = -1;
>>> }
>>>  
>>> if (ret < 0) {
>>> -   ret