Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Tue, Nov 14, 2017 at 3:53 PM, Rasmus Villemoeswrote: > > Odd. 7.2 and 7.1 (both of which I've just compiled from source, no > special configure flags or anything) generate exactly the same (good) > code for fs/namespace.o after patching. I also tried with > CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces > reasonable code. Oh well. It might be configuration-dependent. I usually end up having a couple of different configurations depending on what I'm doing, one being a fairly minimal one, and one being the "allmodconfig" I use mainly for build testing. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Tue, Nov 14, 2017 at 3:53 PM, Rasmus Villemoes wrote: > > Odd. 7.2 and 7.1 (both of which I've just compiled from source, no > special configure flags or anything) generate exactly the same (good) > code for fs/namespace.o after patching. I also tried with > CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces > reasonable code. Oh well. It might be configuration-dependent. I usually end up having a couple of different configurations depending on what I'm doing, one being a fairly minimal one, and one being the "allmodconfig" I use mainly for build testing. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On 14 November 2017 at 23:43, Linus Torvaldswrote: > On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes > wrote: >> >> Can you be more specific? That's not what I see with gcc 7.1. > > I have gcc-7.2.1, and it made a horrible mess of the do_mount() code. Odd. 7.2 and 7.1 (both of which I've just compiled from source, no special configure flags or anything) generate exactly the same (good) code for fs/namespace.o after patching. I also tried with CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces reasonable code. Oh well. Rasmus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On 14 November 2017 at 23:43, Linus Torvalds wrote: > On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes > wrote: >> >> Can you be more specific? That's not what I see with gcc 7.1. > > I have gcc-7.2.1, and it made a horrible mess of the do_mount() code. Odd. 7.2 and 7.1 (both of which I've just compiled from source, no special configure flags or anything) generate exactly the same (good) code for fs/namespace.o after patching. I also tried with CONFIG_CC_OPTIMIZE_FOR_SIZE=y just in case, but that still produces reasonable code. Oh well. Rasmus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoeswrote: > > Can you be more specific? That's not what I see with gcc 7.1. I have gcc-7.2.1, and it made a horrible mess of the do_mount() code. Look for the comment "/* Separate the per-mountpoint flags */" and do the obvious conversion of the simple single-bit stuff. My gcc actually turned those into jumps after _after_ I converted it into the ternary operation, and then the ternary conversion actually did much worse, because it actually had two sides (one with a zero value, and one with the bit value to be set). I didn't have time to look into _why_ that code generated branch-overs, when the otherwise similar reverse case in fs/statfs.c did not. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Tue, Nov 14, 2017 at 2:24 PM, Rasmus Villemoes wrote: > > Can you be more specific? That's not what I see with gcc 7.1. I have gcc-7.2.1, and it made a horrible mess of the do_mount() code. Look for the comment "/* Separate the per-mountpoint flags */" and do the obvious conversion of the simple single-bit stuff. My gcc actually turned those into jumps after _after_ I converted it into the ternary operation, and then the ternary conversion actually did much worse, because it actually had two sides (one with a zero value, and one with the bit value to be set). I didn't have time to look into _why_ that code generated branch-overs, when the otherwise similar reverse case in fs/statfs.c did not. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On 14 November 2017 at 00:54, Linus Torvaldswrote: > On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds > wrote: >> >> So let's just rewrite that mnt_flags conversion that way, justr to get >> gcc to generate the obvious code. > > Oh wow. I tried to do the same thing in fs/namespace.c where it does > the reverse bit translation, and gcc makes a _horrible_ mess of it and > actually makes the code much worse because for some reason the pattern > doesn't trigger. [trimming cc list] Can you be more specific? That's not what I see with gcc 7.1. I've found two blocks where I replaced the if's with the ternary expressions, one in clone_mnt, one in do_mount. In clone_mnt, gcc-7.1 seems to do (first instruction is the unconditional |= MNT_LOCK_ATIME) 10d7: 0d 00 00 04 00 or $0x4,%eax 10dc: 89 43 30mov%eax,0x30(%rbx) 10df: a8 02 test $0x2,%al 10e1: 74 08 je 10eb 10e3: 0d 00 00 20 00 or $0x20,%eax 10e8: 89 43 30mov%eax,0x30(%rbx) 10eb: a8 01 test $0x1,%al 10ed: 74 08 je 10f7 10ef: 0d 00 00 10 00 or $0x10,%eax 10f4: 89 43 30mov%eax,0x30(%rbx) and after patching 10cc: 0d 00 00 04 00 or $0x4,%eax 10d1: 89 c2 mov%eax,%edx 10d3: c1 e2 10shl$0x10,%edx 10d6: 81 e2 00 00 40 00 and$0x40,%edx 10dc: 09 d0 or %edx,%eax 10de: 89 c2 mov%eax,%edx 10e0: c1 e2 14shl$0x14,%edx 10e3: 81 e2 00 00 20 00 and$0x20,%edx 10e9: 09 c2 or %eax,%edx (with a final store of %eax to 0x30(%rbx)). Either way it's four instructions per flag, but I assume the one without the branches is preferable. Similarly, in do_mount, before we have 3429: 44 89 f8mov%r15d,%eax 342c: 83 c8 01or $0x1,%eax 342f: 40 f6 c5 02 test $0x2,%bpl 3433: 44 0f 45 f8 cmovne %eax,%r15d 3437: 44 89 f8mov%r15d,%eax 343a: 83 c8 02or $0x2,%eax 343d: 40 f6 c5 04 test $0x4,%bpl 3441: 44 0f 45 f8 cmovne %eax,%r15d but after patching it does something like 3425: 4d 89 femov%r15,%r14 3428: 48 c1 ea 07 shr$0x7,%rdx 342c: 49 d1 eeshr%r14 342f: 89 d0 mov%edx,%eax 3431: c1 e1 05shl$0x5,%ecx 3434: 83 e0 08and$0x8,%eax 3437: 41 83 e6 07 and$0x7,%r14d 343b: 41 09 c6or %eax,%r14d 343e: 89 d0 mov%edx,%eax which actually makes use of MS_{NOSUID,NODEV,NOEXEC} all being 1 bit off from their MNT_ counterparts (witness the shr %r14 and the and with 0x7), so there we also cut 37 bytes according to bloat-o-meter. Now, it does seem that older (and not that old in absolute terms) compilers may generate worse code after the transformation - the 'replace with shift-mask-or' pattern doesn't exist until 7.1. E.g. with 5.4 we have $ scripts/bloat-o-meter fs/namespace.o.{0,1}-5.4 add/remove: 0/0 grow/shrink: 2/0 up/down: 50/0 (50) function old new delta do_mount31533195 +42 clone_mnt768 776 +8 5.4 compiles the "if() ..." construction roughly as 7.1, i.e. with cmovs, but the ternary expressions are transformed into this abomination 336f: 48 89 damov%rbx,%rdx 3372: 83 e2 04and$0x4,%edx 3375: 48 83 fa 01 cmp$0x1,%rdx 3379: 19 d2 sbb%edx,%edx 337b: f7 d2 not%edx 337d: 83 e2 02and$0x2,%edx 3380: 09 d0 or %edx,%eax 3382: 48 89 damov%rbx,%rdx 3385: 83 e2 08and$0x8,%edx 3388: 48 83 fa 01 cmp$0x1,%rdx 338c: 19 d2 sbb%edx,%edx 338e: f7 d2 not%edx 3390: 83 e2 04and$0x4,%edx 3393: 09 d0 or %edx,%eax Was it something like this you saw? Rasmus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On 14 November 2017 at 00:54, Linus Torvalds wrote: > On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds > wrote: >> >> So let's just rewrite that mnt_flags conversion that way, justr to get >> gcc to generate the obvious code. > > Oh wow. I tried to do the same thing in fs/namespace.c where it does > the reverse bit translation, and gcc makes a _horrible_ mess of it and > actually makes the code much worse because for some reason the pattern > doesn't trigger. [trimming cc list] Can you be more specific? That's not what I see with gcc 7.1. I've found two blocks where I replaced the if's with the ternary expressions, one in clone_mnt, one in do_mount. In clone_mnt, gcc-7.1 seems to do (first instruction is the unconditional |= MNT_LOCK_ATIME) 10d7: 0d 00 00 04 00 or $0x4,%eax 10dc: 89 43 30mov%eax,0x30(%rbx) 10df: a8 02 test $0x2,%al 10e1: 74 08 je 10eb 10e3: 0d 00 00 20 00 or $0x20,%eax 10e8: 89 43 30mov%eax,0x30(%rbx) 10eb: a8 01 test $0x1,%al 10ed: 74 08 je 10f7 10ef: 0d 00 00 10 00 or $0x10,%eax 10f4: 89 43 30mov%eax,0x30(%rbx) and after patching 10cc: 0d 00 00 04 00 or $0x4,%eax 10d1: 89 c2 mov%eax,%edx 10d3: c1 e2 10shl$0x10,%edx 10d6: 81 e2 00 00 40 00 and$0x40,%edx 10dc: 09 d0 or %edx,%eax 10de: 89 c2 mov%eax,%edx 10e0: c1 e2 14shl$0x14,%edx 10e3: 81 e2 00 00 20 00 and$0x20,%edx 10e9: 09 c2 or %eax,%edx (with a final store of %eax to 0x30(%rbx)). Either way it's four instructions per flag, but I assume the one without the branches is preferable. Similarly, in do_mount, before we have 3429: 44 89 f8mov%r15d,%eax 342c: 83 c8 01or $0x1,%eax 342f: 40 f6 c5 02 test $0x2,%bpl 3433: 44 0f 45 f8 cmovne %eax,%r15d 3437: 44 89 f8mov%r15d,%eax 343a: 83 c8 02or $0x2,%eax 343d: 40 f6 c5 04 test $0x4,%bpl 3441: 44 0f 45 f8 cmovne %eax,%r15d but after patching it does something like 3425: 4d 89 femov%r15,%r14 3428: 48 c1 ea 07 shr$0x7,%rdx 342c: 49 d1 eeshr%r14 342f: 89 d0 mov%edx,%eax 3431: c1 e1 05shl$0x5,%ecx 3434: 83 e0 08and$0x8,%eax 3437: 41 83 e6 07 and$0x7,%r14d 343b: 41 09 c6or %eax,%r14d 343e: 89 d0 mov%edx,%eax which actually makes use of MS_{NOSUID,NODEV,NOEXEC} all being 1 bit off from their MNT_ counterparts (witness the shr %r14 and the and with 0x7), so there we also cut 37 bytes according to bloat-o-meter. Now, it does seem that older (and not that old in absolute terms) compilers may generate worse code after the transformation - the 'replace with shift-mask-or' pattern doesn't exist until 7.1. E.g. with 5.4 we have $ scripts/bloat-o-meter fs/namespace.o.{0,1}-5.4 add/remove: 0/0 grow/shrink: 2/0 up/down: 50/0 (50) function old new delta do_mount31533195 +42 clone_mnt768 776 +8 5.4 compiles the "if() ..." construction roughly as 7.1, i.e. with cmovs, but the ternary expressions are transformed into this abomination 336f: 48 89 damov%rbx,%rdx 3372: 83 e2 04and$0x4,%edx 3375: 48 83 fa 01 cmp$0x1,%rdx 3379: 19 d2 sbb%edx,%edx 337b: f7 d2 not%edx 337d: 83 e2 02and$0x2,%edx 3380: 09 d0 or %edx,%eax 3382: 48 89 damov%rbx,%rdx 3385: 83 e2 08and$0x8,%edx 3388: 48 83 fa 01 cmp$0x1,%rdx 338c: 19 d2 sbb%edx,%edx 338e: f7 d2 not%edx 3390: 83 e2 04and$0x4,%edx 3393: 09 d0 or %edx,%eax Was it something like this you saw? Rasmus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvaldswrote: > > So let's just rewrite that mnt_flags conversion that way, justr to get > gcc to generate the obvious code. Oh wow. I tried to do the same thing in fs/namespace.c where it does the reverse bit translation, and gcc makes a _horrible_ mess of it and actually makes the code much worse because for some reason the pattern doesn't trigger. So this gcc optimization is apparently pretty damn fragile in general. It triggers for the trivial cases, but then other code around it can confuse it badly. So I don't think I'll touch this, it seems to not be really reliably something that makes gcc generate what should be the obvious code.. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Mon, Nov 13, 2017 at 3:30 PM, Linus Torvalds wrote: > > So let's just rewrite that mnt_flags conversion that way, justr to get > gcc to generate the obvious code. Oh wow. I tried to do the same thing in fs/namespace.c where it does the reverse bit translation, and gcc makes a _horrible_ mess of it and actually makes the code much worse because for some reason the pattern doesn't trigger. So this gcc optimization is apparently pretty damn fragile in general. It triggers for the trivial cases, but then other code around it can confuse it badly. So I don't think I'll touch this, it seems to not be really reliably something that makes gcc generate what should be the obvious code.. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Mon, Nov 13, 2017 at 2:59 PM, Rasmus Villemoeswrote: >> Sadly, gcc makes a mess of it and actually generates code that looks >> like the original C.[...] > > Actually, new enough gcc (7.1, I think) does contain a pattern that does > this, but unfortunately only if one spells it > > y |= (x & BIT) ? OTHER_BIT : 0; Ahh, I should have recognized that, I think that's what we ended up doing with the VM_READ -> PROT_READ translation in a few places, exactly because gcc would then recognize it and do the much better code generation. > which is half-way to doing it by hand, I suppose. Yeah, but it is at least acceptable, and the code is still legible C. The alternatives of doing it _entirely_ by hand tend to be much worse (ie you end up using a macro from hell that checks which of the two bits are bigger and shifting in the right direction by using multiplication or division). So let's just rewrite that mnt_flags conversion that way, justr to get gcc to generate the obvious code. It's a bit sad how gcc didn't pick up on the original code, especially since it had already done the much more complicated translation of doing the if-conversion. Thanks for pointing out the gcc pattern. Linus
Re: bit tweaks [was: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11]
On Mon, Nov 13, 2017 at 2:59 PM, Rasmus Villemoes wrote: >> Sadly, gcc makes a mess of it and actually generates code that looks >> like the original C.[...] > > Actually, new enough gcc (7.1, I think) does contain a pattern that does > this, but unfortunately only if one spells it > > y |= (x & BIT) ? OTHER_BIT : 0; Ahh, I should have recognized that, I think that's what we ended up doing with the VM_READ -> PROT_READ translation in a few places, exactly because gcc would then recognize it and do the much better code generation. > which is half-way to doing it by hand, I suppose. Yeah, but it is at least acceptable, and the code is still legible C. The alternatives of doing it _entirely_ by hand tend to be much worse (ie you end up using a macro from hell that checks which of the two bits are bigger and shifting in the right direction by using multiplication or division). So let's just rewrite that mnt_flags conversion that way, justr to get gcc to generate the obvious code. It's a bit sad how gcc didn't pick up on the original code, especially since it had already done the much more complicated translation of doing the if-conversion. Thanks for pointing out the gcc pattern. Linus