Re: [patch, libfortran] Add AVX-specific matmul

2016-11-27 Thread Jerry DeLisle

On 11/27/2016 08:50 AM, Thomas Koenig wrote:

Hello world,

here is another, much revised, update of the AVX-specific matmul patch.

The processor-specific switching is now done directly, using the


--- snip ---

This comment not right:

+/* Put exhaustive list of possible architectures here here, ORed together.  */

Performs as expected on my AMD machines. We can still improve peak performance 
on these by about 7%. To clarify, these chips require -mavx -mprefer-avx128. So 
what we need to do is sort out which AMD CPUs need this adjustment with AVX 
registers. (A later patch)


I would like to suggest that rather than matmul_internal.m4 to maybe name this 
file matmul_base.m4, but not critical.


Need a libgcc person for the changes to the cpuinfo items.

The libgfortran portions look OK.

Jerry



Re: [patch, libfortran] Add AVX-specific matmul

2016-11-27 Thread Thomas Koenig

I wrote:


As an added bonus, I added some m4 hacks to disable both
AVX and AVX2 code generation for REAL.


This should have read "I hadded some m4 hacks to disable
the AVX2 code generation for REAL."

Regards

Thomas


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-27 Thread Thomas Koenig

Hello world,

here is another, much revised, update of the AVX-specific matmul patch.

The processor-specific switching is now done directly, using the
machinery from gcclib. For this, I have moved information from
the i386-specific cpuinfo.c file to a new header file cpuinfo.h,
which is then accessed from the matmul function to select the
correct version for the deteced CPU.

For matmul itself, the workhorse function was put into its
own file, which is then included multiple times with
name and target attributes set correctly.

So far, this patch is Intel only.  Jerry's benchmarks indicated
that AVX is actually slower on AMD chips.  Some googing reveals
that other people have had similar experience.

Using AVX128 for AMD processors would be somewhat beneficial,
but that currently cannot be specified as a target attribute.
I'll leave that for later.

As an added bonus, I added some m4 hacks to disable both
AVX and AVX2 code generation for REAL.

So, what do you think?  Is this the right way forward, especially
regarding the CPU detection part?

Regards

Thomas

2016-11-27  Thomas Koenig  

PR fortran/78379
* config/i386/cpuinfo.c:  Move denums for processor vendors,
processor type, processor subtypes and declaration of
struct __processor_model into
* config/i386/cpuinfo.h:  New header file.
* Makefile.am:  Add dependence of m4/matmul_internal_m4 to
mamtul files..
* Makefile.in:  Regenerated.
* acinclude.m4:  Check for AVX, AVX2 and AVX512F.
* config.h.in:  Add HAVE_AVX, HAVE_AVX2 and HAVE_AVX512F.
* configure:  Regenerated.
* configure.ac:  Use checks for AVX, AVX2 and AVX_512F.
* m4/matmul_internal.m4:  New file. working part of matmul.m4.
* m4/matmul.m4:  Implement architecture-specific switching
for AVX, AVX2 and AVX512F by including matmul_internal.m4
multiple times.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.


[Full patch at https://gcc.gnu.org/ml/fortran/2016-11/msg00246.html ,
this was rejected for reasons of size]



Re: [patch, libfortran] Add AVX-specific matmul

2016-11-17 Thread Jakub Jelinek
On Thu, Nov 17, 2016 at 08:41:48AM +0100, Thomas Koenig wrote:
> Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
> >On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
> >>>Don't you need to test in configure if the assembler supports AVX?
> >>>Otherwise if somebody is bootstrapping gcc with older assembler, it will
> >>>just fail to bootstrap.
> >>
> >>That's a good point.  The AVX instructions were added in binutils 2.19,
> >>which was released in 2011. This could be put in the prerequisites.
> >>
> >>What should the test do?  Fail with an error message "you need newer
> >>binutils" or simply (and silently) not compile the AVX vesion?
> >
> >>From what I understood, you want those functions just to be implementation
> >details, not exported from libgfortran.so*.  Thus the test would do
> >something similar to what gcc/testsuite/lib/target-supports.exp 
> >(check_effective_target_avx)
> >does, but of course in autoconf way, not in tcl.
> 
> OK, that looks straightworward enough. I'll give it a shot.
> 
> >Also, from what I see, target_clones just use IFUNCs, so you probably also
> >need some configure test whether ifuncs are supported (the
> >gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
> >similar again in configure.  But if so, then I have no idea why you use
> >a wrapper around the function, instead of using it on the exported APIs.
> 
> As you wrote above, I wanted this as an implementation detail. I also
> wanted the ability to be able to add new instruction sets without
> breaking the ABI.

But even exported IFUNC is an implementation detail.  For other
libraries/binaries IFUNC symbol is like any other symbol, they will have
SHN_UNDEF symbol pointing to that, and it matters only for the dynamic
linker during relocation processing.  Whether some function is IFUNC or not
is not an ABI change, you can change at any time a normal function into
IFUNC or vice versa, without breaking ABI.

> You're right - integer multiplication looks different.
> 
> Nobody I know cares about integer matrix multiplication
> speed, whereas real has gotten a _lot_ of attention over
> the decades.  So, putting in AVX will make the code run
> faster on more machines, while putting in AVX2 will
> (IMHO) bloat the library for no good reason.  However,
> I am willing to stand corrected on this. Putting in AVX512f
> makes sense.

Which is why I've been proposing to use avx2,default for the
matmul_i* files and avx,default for the others.
avx will not buy much for matmul_i*, while avx2 will.

> I have also been trying to get target_clones to work on POWER
> to get Altivec instructions, but to no avail. I also cannot
> find any examples in the testsuite.

Haven't checked, but maybe the target_clones attribute has been only
implemented for x86_64/i686 and not for other targets.
But power supports target attribute, so you e.g. have the option of
#including the routine multiple times in one TU, each time with different
name and target attribute, and then write the IFUNC routine for it by hand.
Or attempt to support target_clones on power, or ask power maintainers
to do that.

Jakub


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-17 Thread Thomas Koenig

Well, here is a newer version of the patch.

I wrote a few configure tests to check for AVX.
This version hast the advantage that, if anybody
uses 32-bit programs with AVX, they would also benefit.

Jakub, would you be OK with that patch?

I do not yet want to commit this because it needs more
testing on different platforms to see if it actually
performs better.

Regarding putting the blocked part into something separate:
Quite doable, but I would rather like to do this in a follow-up
patch, if we decide t do it.

Regards

Thomas

2016-11-17  Thomas Koenig  

PR fortran/78379
* acinclude.m4 (LIBGFOR_CHECK_AVX):  New test.
(LIBGFOR_CHECK_AVX2):  New test.
(LIBGFOR_CHECK_AVX512F):  New test.
* configure.ac:  Call LIBGFOR_CHECK_AVX, LIBGFOR_CHECK_AVX2
and LIBGFOR_CHECK_AVX512F.
* config.h.in: Regenerated.
* configure:  Regenerated.
* m4/matmul.m4: For AVX, AVX2 and AVX_512F, make the work function
for matmul static with target_clones for AVX and default, and
create a wrapper function to call it.
* generated/matmul_c10.c: Regenerated.
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.


Index: acinclude.m4
===
--- acinclude.m4	(Revision 242477)
+++ acinclude.m4	(Arbeitskopie)
@@ -393,3 +393,54 @@ AC_DEFUN([LIBGFOR_CHECK_STRERROR_R], [
 		  [Define if strerror_r takes two arguments and is available in .]),)
   CFLAGS="$ac_save_CFLAGS"
 ])
+
+dnl Check for AVX
+
+AC_DEFUN([LIBGFOR_CHECK_AVX], [
+  ac_save_CFLAGS="$CFLAGS"
+  CFLAGS="-O2 -mavx"
+  AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+  void _mm256_zeroall (void)
+{
+   __builtin_ia32_vzeroall ();
+}]], [[]])],
+	AC_DEFINE(HAVE_AVX, 1,
+	[Define if AVX instructions can be compiled.]),
+	[])
+  CFLAGS="$ac_save_CFLAGS"
+])
+
+dnl Check for AVX2
+
+AC_DEFUN([LIBGFOR_CHECK_AVX2], [
+  ac_save_CFLAGS="$CFLAGS"
+  CFLAGS="-O2 -mavx2"
+  AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+  typedef long long __v4di __attribute__ ((__vector_size__ (32)));
+	__v4di
+	mm256_is32_andnotsi256  (__v4di __X, __v4di __Y)
+{
+	   return __builtin_ia32_andnotsi256 (__X, __Y);
+}]], [[]])],
+	AC_DEFINE(HAVE_AVX2, 1,
+	[Define if AVX2 instructions can be compiled.]),
+	[])
+  CFLAGS="$ac_save_CFLAGS"
+])
+
+dnl Check for AVX512f
+
+AC_DEFUN([LIBGFOR_CHECK_AVX512F], [
+  ac_save_CFLAGS="$CFLAGS"
+  CFLAGS="-O2 -mavx512f"
+  AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+	typedef double __m512d __attribute__ ((__vector_size__ (64)));
+	__m512d _mm512_add (__m512d a)
+	{
+	  return __builtin_ia32_addpd512_mask (a, a, a, 1, 4);
+}]], [[]])],
+	AC_DEFINE(HAVE_AVX512F, 1,
+	[Define if AVX512f instructions can be compiled.]),
+	[])
+  CFLAGS="$ac_save_CFLAGS"
+])
Index: config.h.in
===
--- config.h.in	(Revision 242477)
+++ config.h.in	(Arbeitskopie)
@@ -78,6 +78,15 @@
 /* Define to 1 if the target supports __attribute__((visibility(...))). */
 #undef HAVE_ATTRIBUTE_VISIBILITY
 
+/* Define if AVX instructions can be compiled. */
+#undef HAVE_AVX
+
+/* Define if AVX2 instructions can be compiled. */
+#undef HAVE_AVX2
+
+/* Define if AVX512f instructions can be compiled. */
+#undef HAVE_AVX512F
+
 /* Define to 1 if you have the `cabs' function. */
 #undef HAVE_CABS
 
Index: configure
===
--- configure	(Revision 242477)
+++ configure	(Arbeitskopie)
@@ -26174,6 +26174,93 @@ $as_echo "#define HAVE_CRLF 1" >>confdefs.h
 
 fi
 
+# Check whether we support AVX extensions
+
+  ac_save_CFLAGS="$CFLAGS"
+  CFLAGS="-O2 -mavx"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+  void _mm256_zeroall (void)
+{
+   __builtin_ia32_vzeroall ();
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+
+$as_echo "#define HAVE_AVX 1" >>confdefs.h
+
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+  CFLAGS="$ac_save_CFLAGS"
+
+
+# Check wether we support AVX2 extensions
+
+  ac_save_CFLAGS="$CFLAGS"
+  CFLAGS="-O2 -mavx2"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+  typedef long long __v4di __attribute__ ((__vector_size__ (32)));
+	__v4di
+	mm256_is32_andnotsi256  (__v4di __X, __v4di __Y)
+{
+	   return __builtin_ia32_andnotsi256 (__X, 

Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Janne Blomqvist
On Thu, Nov 17, 2016 at 9:41 AM, Thomas Koenig  wrote:
> Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
>>
>> On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:

 Don't you need to test in configure if the assembler supports AVX?
 Otherwise if somebody is bootstrapping gcc with older assembler, it will
 just fail to bootstrap.
>>>
>>>
>>> That's a good point.  The AVX instructions were added in binutils 2.19,
>>> which was released in 2011. This could be put in the prerequisites.
>>>
>>> What should the test do?  Fail with an error message "you need newer
>>> binutils" or simply (and silently) not compile the AVX vesion?
>>
>>
>>> From what I understood, you want those functions just to be
>>> implementation
>>
>> details, not exported from libgfortran.so*.  Thus the test would do
>> something similar to what gcc/testsuite/lib/target-supports.exp
>> (check_effective_target_avx)
>> does, but of course in autoconf way, not in tcl.
>
>
> OK, that looks straightworward enough. I'll give it a shot.
>
>> Also, from what I see, target_clones just use IFUNCs, so you probably also
>> need some configure test whether ifuncs are supported (the
>> gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
>> similar again in configure.  But if so, then I have no idea why you use
>> a wrapper around the function, instead of using it on the exported APIs.
>
>
> As you wrote above, I wanted this as an implementation detail. I also
> wanted the ability to be able to add new instruction sets without
> breaking the ABI.
>
> Because the caller generates the ifunc, using a wrapper function seemed
> like the best way to do it.  The overhead is neglible (the function
> is one simple jump), especially considering that we only call the
> library function for larger matrices.
>
 For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
 or both avx and avx2 and maybe avx512f?
>>>
>>>
>>> I did a vdiff of the disassembled code generated or avx and avx2, and
>>> (somewhat to my surprise) there was no difference.  Maybe, with more
>>> unrolling, something more might have happened. I didn't check for
>>> AVX512f, but I can do that.
>>
>>
>> For the float/double code it wouldn't surprise me (assuming you don't need
>> gather insns and similar stuff).  But for integers generally most of the
>> avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
>> ones,
>
>
> You're right - integer multiplication looks different.
>
> Nobody I know cares about integer matrix multiplication
> speed, whereas real has gotten a _lot_ of attention over
> the decades.  So, putting in AVX will make the code run
> faster on more machines, while putting in AVX2 will
> (IMHO) bloat the library for no good reason.  However,
> I am willing to stand corrected on this. Putting in AVX512f
> makes sense.
>
> I have also been trying to get target_clones to work on POWER
> to get Altivec instructions, but to no avail. I also cannot
> find any examples in the testsuite.
>
> Since a lot of supercomputers use POWER nodes, that might also
> be attractive.
>
> Regards
>
> Thomas

Hi,

In order to reduce bloat, might it make sense to make the core blocked
gemm algorithm that Jerry committed a few days ago into a separate
static function, and then only do the target_clone stuff for that one?
The rest of the matmul function deals with all kinds of stuff like
setup, handling non-stride-1 cases, calling the external gemm function
for -fexternal-blas etc., none of which vectorizes anyway so
generating different versions of this code using different vector
instructions looks like a waste?

In that case I guess one could add the avx2 variant as well on the odd
chance that somebody for some reason cares about integer matmul.

-- 
Janne Blomqvist


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Thomas Koenig

Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:

On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:

Don't you need to test in configure if the assembler supports AVX?
Otherwise if somebody is bootstrapping gcc with older assembler, it will
just fail to bootstrap.


That's a good point.  The AVX instructions were added in binutils 2.19,
which was released in 2011. This could be put in the prerequisites.

What should the test do?  Fail with an error message "you need newer
binutils" or simply (and silently) not compile the AVX vesion?



From what I understood, you want those functions just to be implementation

details, not exported from libgfortran.so*.  Thus the test would do
something similar to what gcc/testsuite/lib/target-supports.exp 
(check_effective_target_avx)
does, but of course in autoconf way, not in tcl.


OK, that looks straightworward enough. I'll give it a shot.


Also, from what I see, target_clones just use IFUNCs, so you probably also
need some configure test whether ifuncs are supported (the
gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
similar again in configure.  But if so, then I have no idea why you use
a wrapper around the function, instead of using it on the exported APIs.


As you wrote above, I wanted this as an implementation detail. I also
wanted the ability to be able to add new instruction sets without
breaking the ABI.

Because the caller generates the ifunc, using a wrapper function seemed
like the best way to do it.  The overhead is neglible (the function
is one simple jump), especially considering that we only call the
library function for larger matrices.


For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
or both avx and avx2 and maybe avx512f?


I did a vdiff of the disassembled code generated or avx and avx2, and
(somewhat to my surprise) there was no difference.  Maybe, with more
unrolling, something more might have happened. I didn't check for
AVX512f, but I can do that.


For the float/double code it wouldn't surprise me (assuming you don't need
gather insns and similar stuff).  But for integers generally most of the
avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
ones,


You're right - integer multiplication looks different.

Nobody I know cares about integer matrix multiplication
speed, whereas real has gotten a _lot_ of attention over
the decades.  So, putting in AVX will make the code run
faster on more machines, while putting in AVX2 will
(IMHO) bloat the library for no good reason.  However,
I am willing to stand corrected on this. Putting in AVX512f
makes sense.

I have also been trying to get target_clones to work on POWER
to get Altivec instructions, but to no avail. I also cannot
find any examples in the testsuite.

Since a lot of supercomputers use POWER nodes, that might also
be attractive.

Regards

Thomas


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Jakub Jelinek
On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
> >Don't you need to test in configure if the assembler supports AVX?
> >Otherwise if somebody is bootstrapping gcc with older assembler, it will
> >just fail to bootstrap.
> 
> That's a good point.  The AVX instructions were added in binutils 2.19,
> which was released in 2011. This could be put in the prerequisites.
> 
> What should the test do?  Fail with an error message "you need newer
> binutils" or simply (and silently) not compile the AVX vesion?

>From what I understood, you want those functions just to be implementation
details, not exported from libgfortran.so*.  Thus the test would do
something similar to what gcc/testsuite/lib/target-supports.exp 
(check_effective_target_avx)
does, but of course in autoconf way, not in tcl.
Also, from what I see, target_clones just use IFUNCs, so you probably also
need some configure test whether ifuncs are supported (the
gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
similar again in configure.  But if so, then I have no idea why you use
a wrapper around the function, instead of using it on the exported APIs.

> >For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
> >or both avx and avx2 and maybe avx512f?
> 
> I did a vdiff of the disassembled code generated or avx and avx2, and
> (somewhat to my surprise) there was no difference.  Maybe, with more
> unrolling, something more might have happened. I didn't check for
> AVX512f, but I can do that.

For the float/double code it wouldn't surprise me (assuming you don't need
gather insns and similar stuff).  But for integers generally most of the
avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
ones.

Jakub


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Jerry DeLisle

On 11/16/2016 01:30 PM, Thomas Koenig wrote:

Hello world,

the attached patch adds an AVX-specific version of the matmul
intrinsic to the Fortran library.  This works by using the target_clones
attribute.

For testing, I compiled this on powerpc64-unknown-linux-gnu,
without any ill effects.

Also, a resulting binary reached around 15 GFlops for larger matrices
on a 3.4 GHz i7-2600 CPU.  I am currently building/regtesting on
that machine. This can give another 40% speed increase  for large
matrices on AVX.

OK for trunk?



Did you intend to name it avx_matmul and not aux_matmul?

Are the compiler flags for avx handled automatically by the gcc attributes so no 
need to endit the Makefile.am?


Fix the first and if yes to the second question, OK

Jerry







Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Thomas Koenig

Am 16.11.2016 um 23:01 schrieb Jakub Jelinek:

On Wed, Nov 16, 2016 at 10:30:03PM +0100, Thomas Koenig wrote:

the attached patch adds an AVX-specific version of the matmul
intrinsic to the Fortran library.  This works by using the target_clones
attribute.


Don't you need to test in configure if the assembler supports AVX?
Otherwise if somebody is bootstrapping gcc with older assembler, it will
just fail to bootstrap.


That's a good point.  The AVX instructions were added in binutils 2.19,
which was released in 2011. This could be put in the prerequisites.

What should the test do?  Fail with an error message "you need newer
binutils" or simply (and silently) not compile the AVX vesion?


For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
or both avx and avx2 and maybe avx512f?


I did a vdiff of the disassembled code generated or avx and avx2, and
(somewhat to my surprise) there was no difference.  Maybe, with more
unrolling, something more might have happened. I didn't check for
AVX512f, but I can do that.


2016-11-16  Thomas Koenig  

PR fortran/78379
* m4/matmul.m4:  For x86_64, make the work function for matmul


Why the extra space before For?


Will be removed.


static with target_clones for AVX and default, and create
a wrapper function to call it.
* generated/matmul_c10.c


Missing : Regenerated.


Will be added.

Regards

Thomas


Re: [patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Jakub Jelinek
On Wed, Nov 16, 2016 at 10:30:03PM +0100, Thomas Koenig wrote:
> the attached patch adds an AVX-specific version of the matmul
> intrinsic to the Fortran library.  This works by using the target_clones
> attribute.

Don't you need to test in configure if the assembler supports AVX?
Otherwise if somebody is bootstrapping gcc with older assembler, it will
just fail to bootstrap.
For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
or both avx and avx2 and maybe avx512f?

> 2016-11-16  Thomas Koenig  
> 
> PR fortran/78379
> * m4/matmul.m4:  For x86_64, make the work function for matmul

Why the extra space before For?

> static with target_clones for AVX and default, and create
> a wrapper function to call it.
> * generated/matmul_c10.c

Missing : Regenerated.

Jakub


[patch, libfortran] Add AVX-specific matmul

2016-11-16 Thread Thomas Koenig

Hello world,

the attached patch adds an AVX-specific version of the matmul
intrinsic to the Fortran library.  This works by using the target_clones
attribute.

For testing, I compiled this on powerpc64-unknown-linux-gnu,
without any ill effects.

Also, a resulting binary reached around 15 GFlops for larger matrices
on a 3.4 GHz i7-2600 CPU.  I am currently building/regtesting on
that machine. This can give another 40% speed increase  for large
matrices on AVX.

OK for trunk?

Regards

Thomas

2016-11-16  Thomas Koenig  

PR fortran/78379
* m4/matmul.m4:  For x86_64, make the work function for matmul
static with target_clones for AVX and default, and create
a wrapper function to call it.
* generated/matmul_c10.c
* generated/matmul_c16.c: Regenerated.
* generated/matmul_c4.c: Regenerated.
* generated/matmul_c8.c: Regenerated.
* generated/matmul_i1.c: Regenerated.
* generated/matmul_i16.c: Regenerated.
* generated/matmul_i2.c: Regenerated.
* generated/matmul_i4.c: Regenerated.
* generated/matmul_i8.c: Regenerated.
* generated/matmul_r10.c: Regenerated.
* generated/matmul_r16.c: Regenerated.
* generated/matmul_r4.c: Regenerated.
* generated/matmul_r8.c: Regenerated.
Index: generated/matmul_c10.c
===
--- generated/matmul_c10.c	(Revision 242477)
+++ generated/matmul_c10.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c10 (gfc_array_c10 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c10);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c10 (gfc_array_c10 * const restrict retarray, 
 	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c10 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c10 (gfc_array_c10 * const restrict retarray, 
+	gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_10 * restrict abase;
   const GFC_COMPLEX_10 * restrict bbase;
   GFC_COMPLEX_10 * restrict dest;
Index: generated/matmul_c16.c
===
--- generated/matmul_c16.c	(Revision 242477)
+++ generated/matmul_c16.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c16 (gfc_array_c16 * const rest
 	int blas_limit, blas_call gemm);
 export_proto(matmul_c16);
 
+#ifdef __x86_64__
+
+/* For x86_64, we switch to AVX if that is available.  For this, we
+   let the actual work be done by the static aux_matmul - function.
+   The user-callable function will then automagically contain the
+   selection code for the right architecture.  This is done to avoid
+   knowledge of architecture details in the front end.  */
+
+static void aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+	__attribute__ ((target_clones("avx,default")));
+
 void
 matmul_c16 (gfc_array_c16 * const restrict retarray, 
 	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
 	int blas_limit, blas_call gemm)
 {
+  aux_matmul_c16 (retarray, a, b, try_blas, blas_limit, gemm);
+}
+
+static void
+aux_matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#else
+matmul_c16 (gfc_array_c16 * const restrict retarray, 
+	gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
+	int blas_limit, blas_call gemm)
+#endif
+{
   const GFC_COMPLEX_16 * restrict abase;
   const GFC_COMPLEX_16 * restrict bbase;
   GFC_COMPLEX_16 * restrict dest;
Index: generated/matmul_c4.c
===
--- generated/matmul_c4.c	(Revision 242477)
+++ generated/matmul_c4.c	(Arbeitskopie)
@@ -75,11 +75,37 @@ extern void matmul_c4 (gfc_array_c4 * const restri
 	int blas_limit, blas_call gemm);