Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Linus Torvalds
On Tue, Nov 14, 2017 at 3:53 PM, Rasmus Villemoes
 wrote:
>
> Odd. 7.2 and 7.1 (both of which I've just compiled from source, no
> special configure flags or anything) generate exactly the same (good)
> code for fs/namespace.o after patching. I also tried with
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces
> reasonable code. Oh well.

It might be configuration-dependent. I usually end up having a couple
of different configurations depending on what I'm doing, one being a
fairly minimal one, and one being the "allmodconfig" I use mainly for
build testing.

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Linus Torvalds
On Tue, Nov 14, 2017 at 3:53 PM, Rasmus Villemoes
 wrote:
>
> Odd. 7.2 and 7.1 (both of which I've just compiled from source, no
> special configure flags or anything) generate exactly the same (good)
> code for fs/namespace.o after patching. I also tried with
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces
> reasonable code. Oh well.

It might be configuration-dependent. I usually end up having a couple
of different configurations depending on what I'm doing, one being a
fairly minimal one, and one being the "allmodconfig" I use mainly for
build testing.

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Rasmus Villemoes
On 14 November 2017 at 23:43, Linus Torvalds
 wrote:
> On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes
>  wrote:
>>
>> Can you be more specific? That's not what I see with gcc 7.1.
>
> I have gcc-7.2.1, and it made a horrible mess of the do_mount() code.

Odd. 7.2 and 7.1 (both of which I've just compiled from source, no
special configure flags or anything) generate exactly the same (good)
code for fs/namespace.o after patching. I also tried with
CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces
reasonable code. Oh well.

Rasmus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Rasmus Villemoes
On 14 November 2017 at 23:43, Linus Torvalds
 wrote:
> On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes
>  wrote:
>>
>> Can you be more specific? That's not what I see with gcc 7.1.
>
> I have gcc-7.2.1, and it made a horrible mess of the do_mount() code.

Odd. 7.2 and 7.1 (both of which I've just compiled from source, no
special configure flags or anything) generate exactly the same (good)
code for fs/namespace.o after patching. I also tried with
CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces
reasonable code. Oh well.

Rasmus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Linus Torvalds
On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes
 wrote:
>
> Can you be more specific? That's not what I see with gcc 7.1.

I have gcc-7.2.1, and it made a horrible mess of the do_mount() code.

Look for the comment "/* Separate the per-mountpoint flags */" and do
the obvious conversion of the simple single-bit stuff.

My gcc actually turned those into jumps after _after_ I converted it
into the ternary operation, and then the ternary conversion actually
did much worse, because it actually had two sides (one with a zero
value, and one with the bit value to be set).

I didn't have time to look into _why_ that code generated
branch-overs, when the otherwise similar reverse case in fs/statfs.c
did not.

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Linus Torvalds
On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes
 wrote:
>
> Can you be more specific? That's not what I see with gcc 7.1.

I have gcc-7.2.1, and it made a horrible mess of the do_mount() code.

Look for the comment "/* Separate the per-mountpoint flags */" and do
the obvious conversion of the simple single-bit stuff.

My gcc actually turned those into jumps after _after_ I converted it
into the ternary operation, and then the ternary conversion actually
did much worse, because it actually had two sides (one with a zero
value, and one with the bit value to be set).

I didn't have time to look into _why_ that code generated
branch-overs, when the otherwise similar reverse case in fs/statfs.c
did not.

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Rasmus Villemoes
On 14 November 2017 at 00:54, Linus Torvalds
 wrote:
> On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds
>  wrote:
>>
>> So let's just rewrite that mnt_flags conversion that way, justr to get
>> gcc to generate the obvious code.
>
> Oh wow. I tried to do the same thing in fs/namespace.c where it does
> the reverse bit translation, and gcc makes a _horrible_ mess of it and
> actually makes the code much worse because for some reason the pattern
> doesn't trigger.

[trimming cc list]

Can you be more specific? That's not what I see with gcc 7.1. I've
found two blocks where I replaced the if's with the ternary
expressions, one in clone_mnt, one in do_mount. In clone_mnt, gcc-7.1
seems to do (first instruction is the unconditional |= MNT_LOCK_ATIME)

10d7:   0d 00 00 04 00  or $0x4,%eax
10dc:   89 43 30mov%eax,0x30(%rbx)
10df:   a8 02   test   $0x2,%al
10e1:   74 08   je 10eb 
10e3:   0d 00 00 20 00  or $0x20,%eax
10e8:   89 43 30mov%eax,0x30(%rbx)
10eb:   a8 01   test   $0x1,%al
10ed:   74 08   je 10f7 
10ef:   0d 00 00 10 00  or $0x10,%eax
10f4:   89 43 30mov%eax,0x30(%rbx)

and after patching

10cc:   0d 00 00 04 00  or $0x4,%eax
10d1:   89 c2   mov%eax,%edx
10d3:   c1 e2 10shl$0x10,%edx
10d6:   81 e2 00 00 40 00   and$0x40,%edx
10dc:   09 d0   or %edx,%eax
10de:   89 c2   mov%eax,%edx
10e0:   c1 e2 14shl$0x14,%edx
10e3:   81 e2 00 00 20 00   and$0x20,%edx
10e9:   09 c2   or %eax,%edx

(with a final store of %eax to 0x30(%rbx)). Either way it's four
instructions per flag, but I assume the one without the branches is
preferable.

Similarly, in do_mount, before we have

3429:   44 89 f8mov%r15d,%eax
342c:   83 c8 01or $0x1,%eax
342f:   40 f6 c5 02 test   $0x2,%bpl
3433:   44 0f 45 f8 cmovne %eax,%r15d
3437:   44 89 f8mov%r15d,%eax
343a:   83 c8 02or $0x2,%eax
343d:   40 f6 c5 04 test   $0x4,%bpl
3441:   44 0f 45 f8 cmovne %eax,%r15d

but after patching it does something like

3425:   4d 89 femov%r15,%r14
3428:   48 c1 ea 07 shr$0x7,%rdx
342c:   49 d1 eeshr%r14
342f:   89 d0   mov%edx,%eax
3431:   c1 e1 05shl$0x5,%ecx
3434:   83 e0 08and$0x8,%eax
3437:   41 83 e6 07 and$0x7,%r14d
343b:   41 09 c6or %eax,%r14d
343e:   89 d0   mov%edx,%eax

which actually makes use of MS_{NOSUID,NODEV,NOEXEC} all being 1 bit
off from their MNT_ counterparts (witness the shr %r14 and the and
with 0x7), so there we also cut 37 bytes according to bloat-o-meter.

Now, it does seem that older (and not that old in absolute terms)
compilers may generate worse code after the transformation - the
'replace with shift-mask-or' pattern doesn't exist until 7.1. E.g.
with 5.4 we have

$ scripts/bloat-o-meter fs/namespace.o.{0,1}-5.4
add/remove: 0/0 grow/shrink: 2/0 up/down: 50/0 (50)
function old new   delta
do_mount31533195 +42
clone_mnt768 776  +8

5.4 compiles the "if() ..." construction roughly as 7.1, i.e. with
cmovs, but the ternary expressions are transformed into this
abomination

336f:   48 89 damov%rbx,%rdx
3372:   83 e2 04and$0x4,%edx
3375:   48 83 fa 01 cmp$0x1,%rdx
3379:   19 d2   sbb%edx,%edx
337b:   f7 d2   not%edx
337d:   83 e2 02and$0x2,%edx
3380:   09 d0   or %edx,%eax
3382:   48 89 damov%rbx,%rdx
3385:   83 e2 08and$0x8,%edx
3388:   48 83 fa 01 cmp$0x1,%rdx
338c:   19 d2   sbb%edx,%edx
338e:   f7 d2   not%edx
3390:   83 e2 04and$0x4,%edx
3393:   09 d0   or %edx,%eax

Was it something like this you saw?

Rasmus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-14 Thread Rasmus Villemoes
On 14 November 2017 at 00:54, Linus Torvalds
 wrote:
> On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds
>  wrote:
>>
>> So let's just rewrite that mnt_flags conversion that way, justr to get
>> gcc to generate the obvious code.
>
> Oh wow. I tried to do the same thing in fs/namespace.c where it does
> the reverse bit translation, and gcc makes a _horrible_ mess of it and
> actually makes the code much worse because for some reason the pattern
> doesn't trigger.

[trimming cc list]

Can you be more specific? That's not what I see with gcc 7.1. I've
found two blocks where I replaced the if's with the ternary
expressions, one in clone_mnt, one in do_mount. In clone_mnt, gcc-7.1
seems to do (first instruction is the unconditional |= MNT_LOCK_ATIME)

10d7:   0d 00 00 04 00  or $0x4,%eax
10dc:   89 43 30mov%eax,0x30(%rbx)
10df:   a8 02   test   $0x2,%al
10e1:   74 08   je 10eb 
10e3:   0d 00 00 20 00  or $0x20,%eax
10e8:   89 43 30mov%eax,0x30(%rbx)
10eb:   a8 01   test   $0x1,%al
10ed:   74 08   je 10f7 
10ef:   0d 00 00 10 00  or $0x10,%eax
10f4:   89 43 30mov%eax,0x30(%rbx)

and after patching

10cc:   0d 00 00 04 00  or $0x4,%eax
10d1:   89 c2   mov%eax,%edx
10d3:   c1 e2 10shl$0x10,%edx
10d6:   81 e2 00 00 40 00   and$0x40,%edx
10dc:   09 d0   or %edx,%eax
10de:   89 c2   mov%eax,%edx
10e0:   c1 e2 14shl$0x14,%edx
10e3:   81 e2 00 00 20 00   and$0x20,%edx
10e9:   09 c2   or %eax,%edx

(with a final store of %eax to 0x30(%rbx)). Either way it's four
instructions per flag, but I assume the one without the branches is
preferable.

Similarly, in do_mount, before we have

3429:   44 89 f8mov%r15d,%eax
342c:   83 c8 01or $0x1,%eax
342f:   40 f6 c5 02 test   $0x2,%bpl
3433:   44 0f 45 f8 cmovne %eax,%r15d
3437:   44 89 f8mov%r15d,%eax
343a:   83 c8 02or $0x2,%eax
343d:   40 f6 c5 04 test   $0x4,%bpl
3441:   44 0f 45 f8 cmovne %eax,%r15d

but after patching it does something like

3425:   4d 89 femov%r15,%r14
3428:   48 c1 ea 07 shr$0x7,%rdx
342c:   49 d1 eeshr%r14
342f:   89 d0   mov%edx,%eax
3431:   c1 e1 05shl$0x5,%ecx
3434:   83 e0 08and$0x8,%eax
3437:   41 83 e6 07 and$0x7,%r14d
343b:   41 09 c6or %eax,%r14d
343e:   89 d0   mov%edx,%eax

which actually makes use of MS_{NOSUID,NODEV,NOEXEC} all being 1 bit
off from their MNT_ counterparts (witness the shr %r14 and the and
with 0x7), so there we also cut 37 bytes according to bloat-o-meter.

Now, it does seem that older (and not that old in absolute terms)
compilers may generate worse code after the transformation - the
'replace with shift-mask-or' pattern doesn't exist until 7.1. E.g.
with 5.4 we have

$ scripts/bloat-o-meter fs/namespace.o.{0,1}-5.4
add/remove: 0/0 grow/shrink: 2/0 up/down: 50/0 (50)
function old new   delta
do_mount31533195 +42
clone_mnt768 776  +8

5.4 compiles the "if() ..." construction roughly as 7.1, i.e. with
cmovs, but the ternary expressions are transformed into this
abomination

336f:   48 89 damov%rbx,%rdx
3372:   83 e2 04and$0x4,%edx
3375:   48 83 fa 01 cmp$0x1,%rdx
3379:   19 d2   sbb%edx,%edx
337b:   f7 d2   not%edx
337d:   83 e2 02and$0x2,%edx
3380:   09 d0   or %edx,%eax
3382:   48 89 damov%rbx,%rdx
3385:   83 e2 08and$0x8,%edx
3388:   48 83 fa 01 cmp$0x1,%rdx
338c:   19 d2   sbb%edx,%edx
338e:   f7 d2   not%edx
3390:   83 e2 04and$0x4,%edx
3393:   09 d0   or %edx,%eax

Was it something like this you saw?

Rasmus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-13 Thread Linus Torvalds
On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds
 wrote:
>
> So let's just rewrite that mnt_flags conversion that way, justr to get
> gcc to generate the obvious code.

Oh wow. I tried to do the same thing in fs/namespace.c where it does
the reverse bit translation, and gcc makes a _horrible_ mess of it and
actually makes the code much worse because for some reason the pattern
doesn't trigger.

So this gcc optimization is apparently pretty damn fragile in general.
It triggers for the trivial cases, but then other code around it can
confuse it badly.

So I don't think I'll touch this, it seems to not be really reliably
something that makes gcc generate what should be the obvious code..

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-13 Thread Linus Torvalds
On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds
 wrote:
>
> So let's just rewrite that mnt_flags conversion that way, justr to get
> gcc to generate the obvious code.

Oh wow. I tried to do the same thing in fs/namespace.c where it does
the reverse bit translation, and gcc makes a _horrible_ mess of it and
actually makes the code much worse because for some reason the pattern
doesn't trigger.

So this gcc optimization is apparently pretty damn fragile in general.
It triggers for the trivial cases, but then other code around it can
confuse it badly.

So I don't think I'll touch this, it seems to not be really reliably
something that makes gcc generate what should be the obvious code..

   Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-13 Thread Linus Torvalds
On Mon, Nov 13, 2017 at 2:59 PM, Rasmus Villemoes
 wrote:
>> Sadly, gcc makes a mess of it and actually generates code that looks
>> like the original C.[...]
>
> Actually, new enough gcc (7.1, I think) does contain a pattern that does
> this, but unfortunately only if one spells it
>
>   y |= (x & BIT) ? OTHER_BIT : 0;

Ahh, I should have recognized that, I think that's what we ended up
doing with the VM_READ -> PROT_READ translation in a few places,
exactly because gcc would then recognize it and do the much better
code generation.

> which is half-way to doing it by hand, I suppose.

Yeah, but it is at least acceptable, and the code is still legible C.
The alternatives of doing it _entirely_ by hand tend to be much worse
(ie you end up using a macro from hell that checks which of the two
bits are bigger and shifting in the right direction by using
multiplication or division).

So let's just rewrite that mnt_flags conversion that way, justr to get
gcc to generate the obvious code.

It's a bit sad how gcc didn't pick up on the original code, especially
since it had already done the much more complicated translation of
doing the if-conversion.

Thanks for pointing out the gcc pattern.

  Linus


Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]

2017-11-13 Thread Linus Torvalds
On Mon, Nov 13, 2017 at 2:59 PM, Rasmus Villemoes
 wrote:
>> Sadly, gcc makes a mess of it and actually generates code that looks
>> like the original C.[...]
>
> Actually, new enough gcc (7.1, I think) does contain a pattern that does
> this, but unfortunately only if one spells it
>
>   y |= (x & BIT) ? OTHER_BIT : 0;

Ahh, I should have recognized that, I think that's what we ended up
doing with the VM_READ -> PROT_READ translation in a few places,
exactly because gcc would then recognize it and do the much better
code generation.

> which is half-way to doing it by hand, I suppose.

Yeah, but it is at least acceptable, and the code is still legible C.
The alternatives of doing it _entirely_ by hand tend to be much worse
(ie you end up using a macro from hell that checks which of the two
bits are bigger and shifting in the right direction by using
multiplication or division).

So let's just rewrite that mnt_flags conversion that way, justr to get
gcc to generate the obvious code.

It's a bit sad how gcc didn't pick up on the original code, especially
since it had already done the much more complicated translation of
doing the if-conversion.

Thanks for pointing out the gcc pattern.

  Linus