Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-26 Thread Bill Seurer

On 05/26/2017 12:41 AM, Andrew Pinski wrote:

On Thu, May 25, 2017 at 6:43 PM, Jerry DeLisle  wrote:

On 05/25/2017 02:57 PM, Thomas Koenig wrote:


Hi everybody,

I have committed the patch (with the corrections for the name)
as rev 248472.

The infrastructure is in place, so we will be able to make
any fine-tuning easily.

Regards

 Thomas



Based on my testing I think it is close enough as is.


This patch most likely broke all non-x86 targets:
configure: error: conditional "HAVE_AVX128" was never defined.
Usually this means the macro was only invoked conditionally.
Makefile:19843: recipe for target 'configure-target-libgfortran' failed
make[1]: *** [configure-target-libgfortran] Error 1
make[1]: *** Waiting for unfinished jobs


Yup, this is definitely what broke (most-) everything.  248471 worked 
fine and then the above error starting with 248472.

--

-Bill Seurer



Fortran bootstrap failure (was [patch, libfortran] AMD-specific versions of library matmul)

2017-05-26 Thread Wilco Dijkstra
Hi,

This patch most likely broke all non-x86 targets:
configure: error: conditional "HAVE_AVX128" was never defined.
Usually this means the macro was only invoked conditionally.
Makefile:19843: recipe for target 'configure-target-libgfortran' failed
make[1]: *** [configure-target-libgfortran] Error 1
make[1]: *** Waiting for unfinished jobs

I've created PR80889 for this. It would be good to either fix or revert this 
asap
as otherwise we lose all testing for the next 3 days too...

Wilco

Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Andrew Pinski
On Thu, May 25, 2017 at 6:43 PM, Jerry DeLisle  wrote:
> On 05/25/2017 02:57 PM, Thomas Koenig wrote:
>>
>> Hi everybody,
>>
>> I have committed the patch (with the corrections for the name)
>> as rev 248472.
>>
>> The infrastructure is in place, so we will be able to make
>> any fine-tuning easily.
>>
>> Regards
>>
>>  Thomas
>
>
> Based on my testing I think it is close enough as is.

This patch most likely broke all non-x86 targets:
configure: error: conditional "HAVE_AVX128" was never defined.
Usually this means the macro was only invoked conditionally.
Makefile:19843: recipe for target 'configure-target-libgfortran' failed
make[1]: *** [configure-target-libgfortran] Error 1
make[1]: *** Waiting for unfinished jobs


Thanks,
Andrew

>
> Thanks Thomas
>
> Jerry


Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Jerry DeLisle

On 05/25/2017 02:57 PM, Thomas Koenig wrote:

Hi everybody,

I have committed the patch (with the corrections for the name)
as rev 248472.

The infrastructure is in place, so we will be able to make
any fine-tuning easily.

Regards

 Thomas


Based on my testing I think it is close enough as is.

Thanks Thomas

Jerry


Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Thomas Koenig

Hi everybody,

I have committed the patch (with the corrections for the name)
as rev 248472.

The infrastructure is in place, so we will be able to make
any fine-tuning easily.

Regards

Thomas


Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Jerry DeLisle

On 05/25/2017 10:20 AM, Janne Blomqvist wrote:

On Thu, May 25, 2017 at 1:45 PM, Thomas Koenig  wrote:

Hello world,

the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.

Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.

I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet).  From the benchmarks I have read,
this method should also work fairly well on a Ryzen.

So, OK for trunk?


In some comments, you have -mprefer=avx128 whereas the option that gcc
understands is -mprefer-avx128. Also, have you verified that e.g.
contemporary Intel processors still use the avx256 codepath and don't
accidentally end up with avx128?

As for FMA4, are there sufficient numbers of processors supporting
FMA4 but not FMA3 around to justify bloating the library to support
them? I understood that this is only a single AMD CPU generation
("bulldozer" in 2011), the next one ("piledriver" in 2012) added FMA3
in addition to FMA4. And in the new Zen core (Ryzen, Epyc, etc.) AMD
has dropped support for FMA4 although there are reports that it will
still execute FMA4 for backward compatibility although it's no longer
advertised in CPUID, but in any case AMD seems to consider it a legacy
instruction that should not be used anymore (Intel never supported
it).



Good questions. I am testing this on Ryzen now. It does work as advertised. The 
cpu flags only advertise FMA.


So I will be testing the older AMD machine which advertises FMA4 and FMA with 
just the FMA flag and likewise the Ryzen with FMA4 and FMA.


I want to see if there is any breakage between the two generations of AMD I can 
access.


Also Ryzen with and without -mprefer-avx128 will be tested.

I do not have an Intel box to test.

Regards,

Jerry




Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Janne Blomqvist
On Thu, May 25, 2017 at 1:45 PM, Thomas Koenig  wrote:
> Hello world,
>
> the attached patch speeds up the library version of matmul for AMD chips
> by selecting AVX128 instructions and, depending on which instructions
> are supported, either FMA3 (aka FMA) or FMA4.
>
> Jerry tested this on his AMD systems, and found a speedup vs. the
> current code of around 10%.
>
> I have been unable to test this on a Ryzen system (the new compile farm
> machines won't accept my login yet).  From the benchmarks I have read,
> this method should also work fairly well on a Ryzen.
>
> So, OK for trunk?

In some comments, you have -mprefer=avx128 whereas the option that gcc
understands is -mprefer-avx128. Also, have you verified that e.g.
contemporary Intel processors still use the avx256 codepath and don't
accidentally end up with avx128?

As for FMA4, are there sufficient numbers of processors supporting
FMA4 but not FMA3 around to justify bloating the library to support
them? I understood that this is only a single AMD CPU generation
("bulldozer" in 2011), the next one ("piledriver" in 2012) added FMA3
in addition to FMA4. And in the new Zen core (Ryzen, Epyc, etc.) AMD
has dropped support for FMA4 although there are reports that it will
still execute FMA4 for backward compatibility although it's no longer
advertised in CPUID, but in any case AMD seems to consider it a legacy
instruction that should not be used anymore (Intel never supported
it).


-- 
Janne Blomqvist


Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Thomas Koenig

Hi Jerry,


Yes, OK.  Maybe test Ryzen first?


Sure, I can wait for a bit :-)
I just confirmed access to the Ryzen machines so I plan to get set up 
and test there.


The gcc compile farm machines?  My ssh key does not work there...

I have based the choice of FMA(3) over FMA4 when both are available
on a short remark in a benchmark that FMA3 is faster... it might
be interesting to see if that is actually true.

Regards

Thomas


Re: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Jerry DeLisle

On 05/25/2017 03:45 AM, Thomas Koenig wrote:

Hello world,

the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.

Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.

I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet).  From the benchmarks I have read,
this method should also work fairly well on a Ryzen.

So, OK for trunk?


Yes, OK.  Maybe test Ryzen first?

I just confirmed access to the Ryzen machines so I plan to get set up and test 
there.


Time to start looking under the hood.

cat /proc/cpuinfo gives for flags:

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 
constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni 
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c 
rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 
3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext 
perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap 
clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv 
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter 
pfthreshold avic overflow_recov succor smca




Fwd: [patch, libfortran] AMD-specific versions of library matmul

2017-05-25 Thread Thomas Koenig

Hi,

patch is at https://gcc.gnu.org/ml/fortran/2017-05/msg00133.html
(didn't to through to gcc-patches due to size limitations).

Regards

Thomas


 Weitergeleitete Nachricht 
Betreff: [patch, libfortran] AMD-specific versions of library matmul
Datum: Thu, 25 May 2017 12:45:46 +0200
Von: Thomas Koenig <tkoe...@netcologne.de>
An: fort...@gcc.gnu.org <fort...@gcc.gnu.org>, gcc-patches 
<gcc-patches@gcc.gnu.org>


Hello world,

the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.

Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.

I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet).  From the benchmarks I have read,
this method should also work fairly well on a Ryzen.

So, OK for trunk?

Regards

Thomas

2017-05-25  Thomas Koenig  <tkoe...@gcc.gnu.org>

PR libfortran/78379
* Makefile.am: Add generated/matmulavx128_*.c files.
Handle them for compiling and setting the right flags.
* acinclude.m4: Add tests for FMA3, FMA4 and AVX128.
* configure.ac: Call them.
* Makefile.in: Regenerated.
* config.h.in: Regenerated.
* configure: Regenerated.
* m4/matmul.m4:  Handle AMD chips by calling 128-bit AVX
versions which use FMA3 or FMA4.
* m4/matmulavx128.m4: New file.
 * generated/matmul_c10.c: Regenerated.
 * generated/matmul_c16.c: Regenerated.
 * generated/matmul_c4.c: Regenerated.
 * generated/matmul_c8.c: Regenerated.
 * generated/matmul_i1.c: Regenerated.
 * generated/matmul_i16.c: Regenerated.
 * generated/matmul_i2.c: Regenerated.
 * generated/matmul_i4.c: Regenerated.
 * generated/matmul_i8.c: Regenerated.
 * generated/matmul_r10.c: Regenerated.
 * generated/matmul_r16.c: Regenerated.
 * generated/matmul_r4.c: Regenerated.
 * generated/matmul_r8.c: Regenerated.
 * generated/matmulavx128_c10.c: New file.
 * generated/matmulavx128_c16.c: New file.
 * generated/matmulavx128_c4.c: New file.
 * generated/matmulavx128_c8.c: New file.
 * generated/matmulavx128_i1.c: New file.
 * generated/matmulavx128_i16.c: New file.
 * generated/matmulavx128_i2.c: New file.
 * generated/matmulavx128_i4.c: New file.
 * generated/matmulavx128_i8.c: New file.
 * generated/matmulavx128_r10.c: New file.
 * generated/matmulavx128_r16.c: New file.
 * generated/matmulavx128_r4.c: New file.
 * generated/matmulavx128_r8.c: New file.