Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread Segher Boessenkool
On Sat, Mar 30, 2019 at 10:30:23AM +, George Spelvin wrote:
> For s390, that was added on 24 March 2017 by
> https://gcc.gnu.org/viewcvs/gcc?view=revision=246457
> which is part of GCC 7.
> 
> It also only applies to TARGET_ARCH12, which I am guessing
> corresponds to HAVE_MARCH_ZEC12_FEATURES.

zEC12 is arch10, while z14 is arch12.  See
https://sourceware.org/binutils/docs-2.32/as/s390-Options.html#s390-Options
for example; it lists the correspondences, and states "The processor names
starting with arch refer to the edition number in the Principle of
Operations manual.  They can be used as alternate processor names and have
been added for compatibility with the IBM XL compiler."

Newer GCC does not use the somewhat confusing TARGET_ARCH12 name anymore;
see https://gcc.gnu.org/r264796 .


Segher


Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread Segher Boessenkool
On Sat, Mar 30, 2019 at 11:28:21AM +, George Spelvin wrote:
> >> I like that the MIPS code leaves the high half of the product in
> >> the hi register until it tests the low half; I wish PowerPC would
> >> similarly move the mulhdu *after* the loop,
> 
> > The MIPS code has the multiplication inside the loop as well, and even
> > the mfhi I think: MIPS has delay slots.
> 
> Yes, it's in the delay slot, which is fine (the branch is unlikely,
> after all).  But it does the compare (sltu) before accessing %hi, which
> is good as %hi often has a longer latency than %lo.  (On out-of-order
> cores, of course, none of this matters.)

Yes.  But it does the mfhi on every iteration, while it only needs it for
the last one (or after the last one).  This may not be more expensive for
the actual hardware, but it is for GCC's cost model

> > GCC treats the int128 as one register until it has expanded to RTL, and it
> > does not do such loop optimisations after that, apparently.
> > 
> > File a PR please?  https://gcc.gnu.org/bugzilla/
> 
> Er...  about what?  The fact that the PowerPC code is not
> >> PowerPC:
> >> .L9:
> >>bl get_random_u64
> >>nop
> >>mulld 9,3,31
> >>cmpld 7,30,9
> >>bgt 7,.L9
> >>mulhdu 3,3,31
> 
> I'm not sure quite how to explain it in gcc-ese.

Yeah, exactly, like that.  This transformation is called "loop sinking"
usually: if anything that is set within a loop is only used after the loop,
it can be set after the loop (provided you keep the set's sources alive).


Segher


Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread Segher Boessenkool
On Sat, Mar 30, 2019 at 09:00:15AM +1300, Michael Cree wrote:
> It does move the umulh inside the loop but that seems sensible since
> the use of unlikely() implies that the loop is unlikely to be taken
> so on average it would be a good bet to start the calculation of
> umulh earlier since it has a few cycles latency to get the result,
> and it is pipelined so it can be calculated in the shadow of the
> mulq instruction on the same execution unit.

That may make sense, but it is not what happens, sorry.  It _starts off_
as part of the loop, and it is never moved outside.

The only difference between a likely loop and an unlikely loop here I've
seen (on all targets I tried) is that with a likely loop the loop target
is aligned, while with an unlikely loop it isn't.

> On the older CPUs
> (before EV6 which are not out-of-order execution) having the umulh
> inside the loop may be a net gain.

Yes.  Similarly, on Power you can often calculate the high mul at the same
time as the low mul, for no extra cost.  This may be true on many archs.


Segher


Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread George Spelvin
I've been tracking down when "umulditi3" support was added to various
gcc platforms.  So far, I've found:

ia64: 2005-07-28, gcc 4.1.0
https://gcc.gnu.org/viewcvs/gcc?view=revision=102463

mips: 2008-06-09, gcc 4.4.0
https://gcc.gnu.org/viewcvs/gcc?view=revision=136600 

alpha: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision=195668

s390: 2011-10-07, gcc 4.7.0
https://gcc.gnu.org/viewcvs/gcc?view=revision=179647

ppc64: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision=195667


Here's a revised s390 patch.

>From e8ea8c0a5d618385049248649b8c13717b598a42 Mon Sep 17 00:00:00 2001
From: George Spelvin 
Date: Sat, 30 Mar 2019 10:27:14 +
Subject: [PATCH v2] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 2011-10-07 by
https://gcc.gnu.org/viewcvs/gcc?view=revision=179647
which is part of GCC 4.7.

Signed-off-by: George Spelvin 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..6ddaee6573f4 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
def_bool y
+   select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 40700 || CC_IS_CLANG
 
 config COMPAT
def_bool y
-- 
2.20.1



Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread George Spelvin
General update:

I've since found the reason for the GCC version check.
It's not *broken* support (apologies for impugning GCC),
but *inefficient* support via a muditi3 helper function.

The version check is the version which added in-line code generation.
For example, s390 support was added in March of 2017 and is part
of the 7.1 release.

I hvave to do some digging through gcc version history to
find when it was added to various architectures.
(And MIPS64v6 support is still lacking in gcc 9.)


On Fri, 29 Mar 2019 at 15:25:58 -0500, Segher Boessenkool wrote:
> On Fri, Mar 29, 2019 at 01:07:07PM +, George Spelvin wrote:
>> I don't have easy access to an Alpha cross-compiler to test, but
>> as it has UMULH, I suspect it would work, too.
> 
> https://mirrors.edge.kernel.org/pub/tools/crosstool/

Thanks for the pointer; I have a working Alpha cross-compiler now.
 
>> u64 get_random_u64(void);
>> u64 get_random_max64(u64 range, u64 lim)
>> {
>>  unsigned __int128 prod;
>>  do {
>>  prod = (unsigned __int128)get_random_u64() * range;
>>  } while (unlikely((u64)prod < lim));
>>  return prod >> 64;
>> }
> 
>> MIPS:
>> .L7:
>>  jal get_random_u64
>>  nop
>>  dmultu $2,$17
>>  mflo$3
>>  sltu$4,$3,$16
>>  bne $4,$0,.L7
>>  mfhi$2
>> 
>> PowerPC:
>> .L9:
>>  bl get_random_u64
>>  nop
>>  mulld 9,3,31
>>  mulhdu 3,3,31
>>  cmpld 7,30,9
>>  bgt 7,.L9
>>
>> I like that the MIPS code leaves the high half of the product in
>> the hi register until it tests the low half; I wish PowerPC would
>> similarly move the mulhdu *after* the loop,

> The MIPS code has the multiplication inside the loop as well, and even
> the mfhi I think: MIPS has delay slots.

Yes, it's in the delay slot, which is fine (the branch is unlikely,
after all).  But it does the compare (sltu) before accessing %hi, which
is good as %hi often has a longer latency than %lo.  (On out-of-order
cores, of course, none of this matters.)

> GCC treats the int128 as one register until it has expanded to RTL, and it
> does not do such loop optimisations after that, apparently.
> 
> File a PR please?  https://gcc.gnu.org/bugzilla/

Er...  about what?  The fact that the PowerPC code is not
>> PowerPC:
>> .L9:
>>  bl get_random_u64
>>  nop
>>  mulld 9,3,31
>>  cmpld 7,30,9
>>  bgt 7,.L9
>>  mulhdu 3,3,31

I'm not sure quite how to explain it in gcc-ese.


Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread George Spelvin
On Sat, 30 Mar 2019 at 09:43:47 +0100, Heiko Carstens wrote:
> It hasn't been enabled on s390 simply because at least I wasn't aware
> of this config option. Feel free to send a patch, otherwise I will
> enable this. Whatever you prefer.
> 
> Thanks for pointing this out!

Here's a draft patch, but obviously it should be tested!

>From 6f3cc608c49dba33a38e81232a2fd46e8fa8dbcd Mon Sep 17 00:00:00 2001
From: George Spelvin 
Date: Sat, 30 Mar 2019 10:27:14 +
Subject: [PATCH] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 24 March 2017 by
https://gcc.gnu.org/viewcvs/gcc?view=revision=246457
which is part of GCC 7.

It also only applies to TARGET_ARCH12, which I am guessing
corresponds to HAVE_MARCH_ZEC12_FEATURES.  clang support is
pure guesswork.

Signed-off-by: George Spelvin 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..43e6dc677f7d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
def_bool y
+   select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 7 && 
HAVE_MARCH_ZEC12_FEATURES || CC_IS_CLANG
 
 config COMPAT
def_bool y
-- 
2.20.1



Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread Heiko Carstens
On Fri, Mar 29, 2019 at 01:07:07PM +, George Spelvin wrote:
> (Cross-posted in case there are generic issues; please trim if
> discussion wanders into single-architecture details.)
> 
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in  if I cross-compile them.
> 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.
> 
> Is there a reason it hasn't been enabled on these platforms?

It hasn't been enabled on s390 simply because at least I wasn't aware
of this config option. Feel free to send a patch, otherwise I will
enable this. Whatever you prefer.

Thanks for pointing this out!



Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-30 Thread Segher Boessenkool
Hi!

On Fri, Mar 29, 2019 at 01:07:07PM +, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in  if I cross-compile them.

Yup.

> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

https://mirrors.edge.kernel.org/pub/tools/crosstool/

> u64 get_random_u64(void);
> u64 get_random_max64(u64 range, u64 lim)
> {
>   unsigned __int128 prod;
>   do {
>   prod = (unsigned __int128)get_random_u64() * range;
>   } while (unlikely((u64)prod < lim));
>   return prod >> 64;
> }

> Which turns into these inner loops:
> MIPS:
> .L7:
>   jal get_random_u64
>   nop
>   dmultu $2,$17
>   mflo$3
>   sltu$4,$3,$16
>   bne $4,$0,.L7
>   mfhi$2
> 
> PowerPC:
> .L9:
>   bl get_random_u64
>   nop
>   mulld 9,3,31
>   mulhdu 3,3,31
>   cmpld 7,30,9
>   bgt 7,.L9
> 
> s/390:
> .L13:
>   brasl   %r14,get_random_u64@PLT
>   lgr %r5,%r2
>   mlgr%r4,%r10
>   lgr %r2,%r4
>   clgr%r11,%r5
>   jh  .L13
> 
> I like that the MIPS code leaves the high half of the product in
> the hi register until it tests the low half; I wish PowerPC would
> similarly move the mulhdu *after* the loop,

The MIPS code has the multiplication inside the loop as well, and even
the mfhi I think: MIPS has delay slots.

GCC treats the int128 as one register until it has expanded to RTL, and it
does not do such loop optimisations after that, apparently.

File a PR please?  https://gcc.gnu.org/bugzilla/


Segher


Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-29 Thread Michael Cree
On Fri, Mar 29, 2019 at 01:07:07PM +, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
[snip] 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

On Debian/Ubuntu it is just a matter of:
apt-get install gcc-alpha-linux-gnu

> Or this handwritten Alpha code:
> 1:
>   bsr $26, get_random_u64
>   mulq$0, $9, $1  # $9 is range
>   cmpult  $1, $10, $1 # $10 is lim
>   bne $1, 1b
>   umulh   $0, $9, $0

The compiler produces:

$L2:
ldq $27,get_random_u64($29) !literal!2
jsr $26,($27),get_random_u64!lituse_jsr!2
ldah $29,0($26) !gpdisp!3
mulq $0,$9,$1
lda $29,0($29)  !gpdisp!3
umulh $0,$9,$0
cmpule $10,$1,$1
beq $1,$L2

It does move the umulh inside the loop but that seems sensible since
the use of unlikely() implies that the loop is unlikely to be taken
so on average it would be a good bet to start the calculation of
umulh earlier since it has a few cycles latency to get the result,
and it is pipelined so it can be calculated in the shadow of the
mulq instruction on the same execution unit.  On the older CPUs
(before EV6 which are not out-of-order execution) having the umulh
inside the loop may be a net gain.

Cheers,
Michael.


CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

2019-03-29 Thread George Spelvin
(Cross-posted in case there are generic issues; please trim if
discussion wanders into single-architecture details.)

I was working on some scaling code that can benefit from 64x64->128-bit
multiplies.  GCC supports an __int128 type on processors with hardware
support (including z/Arch and MIPS64), but the support was broken on
early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.

Currently, of the ten 64-bit architectures Linux supports, that's
only enabled on x86, ARM, and RISC-V.

SPARC and HP-PA don't have support.

But that leaves Alpha, Mips, PowerPC, and S/390x.

Current mips64, powerpc64, and s390x gcc seems to generate sensible code
for mul_u64_u64_shr() in  if I cross-compile them.

I don't have easy access to an Alpha cross-compiler to test, but
as it has UMULH, I suspect it would work, too.

Is there a reason it hasn't been enabled on these platforms?

There might be a MIPS64r6 issue, since r6 changed from DMULTU
writing the lo and hi registers to DMULU/DMUHU, and gcc 8.3, at
least, doesn't know how to generate inline code for the latter.

(Note that users *also* check __INT128__, which is defined if GCC
claims to support __int128, so you don't have to worry about 32-bit
compiles or ancient compilers.  It only has to be conditional on
*broken* support.)


FWIW, the code I'm working on has this inner loop:
(https://arxiv.org/abs/1805.10941 for details)

u64 get_random_u64(void);
u64 get_random_max64(u64 range, u64 lim)
{
unsigned __int128 prod;
do {
prod = (unsigned __int128)get_random_u64() * range;
} while (unlikely((u64)prod < lim));
return prod >> 64;
}

Which turns into these inner loops:
MIPS:
.L7:
jal get_random_u64
nop
dmultu $2,$17
mflo$3
sltu$4,$3,$16
bne $4,$0,.L7
mfhi$2

PowerPC:
.L9:
bl get_random_u64
nop
mulld 9,3,31
mulhdu 3,3,31
cmpld 7,30,9
bgt 7,.L9

s/390:
.L13:
brasl   %r14,get_random_u64@PLT
lgr %r5,%r2
mlgr%r4,%r10
lgr %r2,%r4
clgr%r11,%r5
jh  .L13

I like that the MIPS code leaves the high half of the product in
the hi register until it tests the low half; I wish PowerPC would
similarly move the mulhdu *after* the loop, like the following
hypothetical MIPS R6 code:

.L7:
balcget_random_u64
dmulu   $3, $2, $17
sltu$3, $3, $16
bnezc   $3, .L7
dmuhu   $2, $2, $17

Or this handwritten Alpha code:
1:
bsr $26, get_random_u64
mulq$0, $9, $1  # $9 is range
cmpult  $1, $10, $1 # $10 is lim
bne $1, 1b
umulh   $0, $9, $0