Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Marc Glisse

Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

(another part of the discussion is around 
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )


Most people who commented seem cautiously in favor. The least favorable 
was Ulrich who suggested to go with it but keep the old behavior 
accessible if the user defines some macro (which imho would lose a large 
part of the simplification benefits of the patch)

https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

If this is accepted, I will gladly prepare patches removing the unused 
builtins and extending this to a few more operations (integer vectors in 
particular). If this is not the direction we want to go, I'd like to hear 
it clearly so I can move on...


My main doubt with the current patch is whether it is better to write 
simply (both variables have type __m128d):

__A + __B
or, as we will have to do for integers:
(__m128d)((__v2df)__A + (__v2df)__B)

--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Uros Bizjak
On Thu, Oct 9, 2014 at 12:33 PM, Marc Glisse marc.gli...@inria.fr wrote:
 Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

 (another part of the discussion is around
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )

 Most people who commented seem cautiously in favor. The least favorable was
 Ulrich who suggested to go with it but keep the old behavior accessible if
 the user defines some macro (which imho would lose a large part of the
 simplification benefits of the patch)
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

 If this is accepted, I will gladly prepare patches removing the unused
 builtins and extending this to a few more operations (integer vectors in
 particular). If this is not the direction we want to go, I'd like to hear it
 clearly so I can move on...

Well, I'm undecided.

The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that ain't broken, and
at least some proof that we won't regress in this area with the new
approach.

On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).

Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...

TL/DR: If there are benefits, no regressions and you think you'll
finish the transition, let's go for it.

Uros.


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Marc Glisse

On Thu, 9 Oct 2014, Uros Bizjak wrote:


On Thu, Oct 9, 2014 at 12:33 PM, Marc Glisse marc.gli...@inria.fr wrote:

Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

(another part of the discussion is around
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )

Most people who commented seem cautiously in favor. The least favorable was
Ulrich who suggested to go with it but keep the old behavior accessible if
the user defines some macro (which imho would lose a large part of the
simplification benefits of the patch)
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

If this is accepted, I will gladly prepare patches removing the unused
builtins and extending this to a few more operations (integer vectors in
particular). If this is not the direction we want to go, I'd like to hear it
clearly so I can move on...


Well, I'm undecided.


First, thanks for answering, it helps me a lot to know what others think.


The current approach is proven to work OK, there is no bugs reported
in this area and the performance is apparently OK. There should be
clear benefits in order to change something that ain't broken, and
at least some proof that we won't regress in this area with the new
approach.


There are quite a few enhancement PRs asking for more performance, but 
indeed no (or very few) complaints about correctness or about gcc turning 
their code into something worse than what they wrote, which I completely 
agree weighs more.



On the other hand, if the new approach opens new optimization
opportunities (without regression!), I'm in favor of it, including the
fact that new code won't produce equivalent assembly - as long as
functionality of the optimized asm stays the same (obviously, I'd
say).

Please also note that this is quite big project. There are plenty of
intrinsics and I for one don't want another partial transition ...


That might be an issue : this transition is partial by nature. Many 
intrinsics cannot (easily) be expressed in GIMPLE, and among those that 
can be represented, we only want to change those for which we are 
confident that we will not regress the quality of the code. From the 
reactions, I would assume that we want to be quite conservative at the 
beginning, and maybe we can reconsider some other intrinsics later.


The best I can offer is consistency: if addition of v2df is changed, 
addition of v4df is changed as well (and say any +-*/ of float/double 
vectors of any supported size). Another block would be +-*/% for integer 
vectors. And construction / access (most construction is already 
builtin-free). And remove the unused builtins in the same patch that makes 
them unused. If you don't like those blocks, I can write one mega-patch 
that does all these, if we roughly agree on the list beforehand, so it 
goes in all at once.


Would that be good enough?


TL/DR: If there are benefits, no regressions and you think you'll
finish the transition, let's go for it.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Uros Bizjak
On Thu, Oct 9, 2014 at 2:28 PM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 9 Oct 2014, Uros Bizjak wrote:

 On Thu, Oct 9, 2014 at 12:33 PM, Marc Glisse marc.gli...@inria.fr wrote:

 Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

 (another part of the discussion is around
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )

 Most people who commented seem cautiously in favor. The least favorable
 was
 Ulrich who suggested to go with it but keep the old behavior accessible
 if
 the user defines some macro (which imho would lose a large part of the
 simplification benefits of the patch)
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

 If this is accepted, I will gladly prepare patches removing the unused
 builtins and extending this to a few more operations (integer vectors in
 particular). If this is not the direction we want to go, I'd like to hear
 it
 clearly so I can move on...


 Well, I'm undecided.


 First, thanks for answering, it helps me a lot to know what others think.

 The current approach is proven to work OK, there is no bugs reported
 in this area and the performance is apparently OK. There should be
 clear benefits in order to change something that ain't broken, and
 at least some proof that we won't regress in this area with the new
 approach.


 There are quite a few enhancement PRs asking for more performance, but
 indeed no (or very few) complaints about correctness or about gcc turning
 their code into something worse than what they wrote, which I completely
 agree weighs more.

 On the other hand, if the new approach opens new optimization
 opportunities (without regression!), I'm in favor of it, including the
 fact that new code won't produce equivalent assembly - as long as
 functionality of the optimized asm stays the same (obviously, I'd
 say).

 Please also note that this is quite big project. There are plenty of
 intrinsics and I for one don't want another partial transition ...


 That might be an issue : this transition is partial by nature. Many
 intrinsics cannot (easily) be expressed in GIMPLE, and among those that can
 be represented, we only want to change those for which we are confident that
 we will not regress the quality of the code. From the reactions, I would
 assume that we want to be quite conservative at the beginning, and maybe we
 can reconsider some other intrinsics later.

 The best I can offer is consistency: if addition of v2df is changed,
 addition of v4df is changed as well (and say any +-*/ of float/double
 vectors of any supported size). Another block would be +-*/% for integer
 vectors. And construction / access (most construction is already
 builtin-free). And remove the unused builtins in the same patch that makes
 them unused. If you don't like those blocks, I can write one mega-patch that
 does all these, if we roughly agree on the list beforehand, so it goes in
 all at once.

 Would that be good enough?

OK, let's go in the proposed way, more detailed:

- we begin with +-*/ of float/double vectors. IMO, this would result
in a relatively small and easily reviewable patch to iron out the
details of the approach. Alternatively, we can begin with floats only.
- commit the patch and wait for the sky to fall down.
- we play a bit with the compiler to check generated code and corner
cases (some kind of Q/A) and wait if someone finds a problem (say, a
couple of weeks).
- if there are no problems, continue with integer builtins following
the established approach, otherwise we revert everything and go back
to the drawing board.
- repeat the procedure for other builtins.

I propose to wait a couple of days for possible comments before we get
the ball rolling.

Uros.


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Kirill Yukhin
Hello folks,
On 09 Oct 14:57, Uros Bizjak wrote:
 On Thu, Oct 9, 2014 at 2:28 PM, Marc Glisse marc.gli...@inria.fr wrote:
  On Thu, 9 Oct 2014, Uros Bizjak wrote:
 OK, let's go in the proposed way, more detailed:
 
 - we begin with +-*/ of float/double vectors. IMO, this would result
 in a relatively small and easily reviewable patch to iron out the
 details of the approach. Alternatively, we can begin with floats only.
 - commit the patch and wait for the sky to fall down.
 - we play a bit with the compiler to check generated code and corner
 cases (some kind of Q/A) and wait if someone finds a problem (say, a
 couple of weeks).
 - if there are no problems, continue with integer builtins following
 the established approach, otherwise we revert everything and go back
 to the drawing board.
 - repeat the procedure for other builtins.
 
 I propose to wait a couple of days for possible comments before we get
 the ball rolling.
Let me repeat, I think this is good idea to do.
I just wanted to kindly ask you wait for about 1-2ww before checking-in
this things.
I hope in that time AVX-512VL,BW,DQ will hit trunk completely 
and *lots* more intrinsics will be added (I think intrinsics is
subject of ~[85/n] patch).

--
Thanks, K

 
 Uros.


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread H.J. Lu
On Thu, Oct 9, 2014 at 5:57 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Thu, Oct 9, 2014 at 2:28 PM, Marc Glisse marc.gli...@inria.fr wrote:
 On Thu, 9 Oct 2014, Uros Bizjak wrote:

 On Thu, Oct 9, 2014 at 12:33 PM, Marc Glisse marc.gli...@inria.fr wrote:

 Ping https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html

 (another part of the discussion is around
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html )

 Most people who commented seem cautiously in favor. The least favorable
 was
 Ulrich who suggested to go with it but keep the old behavior accessible
 if
 the user defines some macro (which imho would lose a large part of the
 simplification benefits of the patch)
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02328.html

 If this is accepted, I will gladly prepare patches removing the unused
 builtins and extending this to a few more operations (integer vectors in
 particular). If this is not the direction we want to go, I'd like to hear
 it
 clearly so I can move on...


 Well, I'm undecided.


 First, thanks for answering, it helps me a lot to know what others think.

 The current approach is proven to work OK, there is no bugs reported
 in this area and the performance is apparently OK. There should be
 clear benefits in order to change something that ain't broken, and
 at least some proof that we won't regress in this area with the new
 approach.


 There are quite a few enhancement PRs asking for more performance, but
 indeed no (or very few) complaints about correctness or about gcc turning
 their code into something worse than what they wrote, which I completely
 agree weighs more.

 On the other hand, if the new approach opens new optimization
 opportunities (without regression!), I'm in favor of it, including the
 fact that new code won't produce equivalent assembly - as long as
 functionality of the optimized asm stays the same (obviously, I'd
 say).

 Please also note that this is quite big project. There are plenty of
 intrinsics and I for one don't want another partial transition ...


 That might be an issue : this transition is partial by nature. Many
 intrinsics cannot (easily) be expressed in GIMPLE, and among those that can
 be represented, we only want to change those for which we are confident that
 we will not regress the quality of the code. From the reactions, I would
 assume that we want to be quite conservative at the beginning, and maybe we
 can reconsider some other intrinsics later.

 The best I can offer is consistency: if addition of v2df is changed,
 addition of v4df is changed as well (and say any +-*/ of float/double
 vectors of any supported size). Another block would be +-*/% for integer
 vectors. And construction / access (most construction is already
 builtin-free). And remove the unused builtins in the same patch that makes
 them unused. If you don't like those blocks, I can write one mega-patch that
 does all these, if we roughly agree on the list beforehand, so it goes in
 all at once.

 Would that be good enough?

 OK, let's go in the proposed way, more detailed:

 - we begin with +-*/ of float/double vectors. IMO, this would result
 in a relatively small and easily reviewable patch to iron out the
 details of the approach. Alternatively, we can begin with floats only.
 - commit the patch and wait for the sky to fall down.
 - we play a bit with the compiler to check generated code and corner
 cases (some kind of Q/A) and wait if someone finds a problem (say, a
 couple of weeks).
 - if there are no problems, continue with integer builtins following
 the established approach, otherwise we revert everything and go back
 to the drawing board.
 - repeat the procedure for other builtins.

 I propose to wait a couple of days for possible comments before we get
 the ball rolling.


We should also include some testcases to show code improvement
for each change.

Thanks.


-- 
H.J.


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Olivier Hainque
Hello Marc,

On Oct 9, 2014, at 12:33 PM, Marc Glisse wrote:
 If this is accepted, I will gladly prepare patches removing the unused 
 builtins and extending this to a few more operations (integer vectors in 
 particular). If this is not the direction we want to go, I'd like to hear it 
 clearly so I can move on...

As we discussed offlist, removing all the builtins would be problematic for
Ada as they are the only medium allowing flexible access to vector instructions
(aside autovectorization) for users.

Today, the model is very simple: people who want to build on top of vector
operations just bind to the builtins they need and expose higher level
interfaces if they like, provided proper type definitions (see g-sse.ads for
example).

We could provide an Ada version of the standard APIs for example, as we do for
Altivec on powerpc, and we have offered this capability out of customer 
requests.

Without the builtins, we'd need to define syntax + semantics for vector
operations in the language. While this is an interesting perspective, we
don't have that today and this would be a fair amount of non-trivial work
I'm afraid, not something we can take on just like that.

Note that this doesn't mean that we need all the builtins to remain there.
Just at least one of those providing access to a given machine insn at some
point. We can implement various sorts of always_inline wrappers to perform
type conversions as needed, and builtins are understood as very low level
devices so changes in the interface aren't an issue. The real issue would be
if access to a given insn becomes impossible out of the removals.

Thanks!

With Kind Regards,

Olivier




Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Marc Glisse

On Thu, 9 Oct 2014, Olivier Hainque wrote:


On Oct 9, 2014, at 12:33 PM, Marc Glisse wrote:

If this is accepted, I will gladly prepare patches removing the unused builtins 
and extending this to a few more operations (integer vectors in particular). If 
this is not the direction we want to go, I'd like to hear it clearly so I can 
move on...


As we discussed offlist, removing all the builtins would be problematic for
Ada as they are the only medium allowing flexible access to vector instructions
(aside autovectorization) for users.

Today, the model is very simple: people who want to build on top of vector
operations just bind to the builtins they need and expose higher level
interfaces if they like, provided proper type definitions (see g-sse.ads for
example).


It is sad that this prevents us from removing the builtins, but I agree 
that we can't just drop ada+sse users like that. Well, less work for me if 
I don't have to remove the builtins, and my main motivation is 
optimization, even if I tried to sell the clean up to convince people.


Uros, is it still ok if I change the intrinsics without removing the 
builtins? (with testcases for HJ and not before Kirill says it is ok)



Without the builtins, we'd need to define syntax + semantics for vector
operations in the language. While this is an interesting perspective, we
don't have that today and this would be a fair amount of non-trivial work
I'm afraid, not something we can take on just like that.


I think it is an interesting possibility to keep in mind (maybe in a few 
years?). Basic support in the C front-end is surprisingly simple (C++ 
templates are a different story), and doesn't need to be duplicated for 
sse/altivec/neon... only the weird operations really need builtins.


Thanks for posting this,

--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Uros Bizjak
On Thu, Oct 9, 2014 at 7:46 PM, Marc Glisse marc.gli...@inria.fr wrote:

 If this is accepted, I will gladly prepare patches removing the unused
 builtins and extending this to a few more operations (integer vectors in
 particular). If this is not the direction we want to go, I'd like to hear it
 clearly so I can move on...


 As we discussed offlist, removing all the builtins would be problematic
 for
 Ada as they are the only medium allowing flexible access to vector
 instructions
 (aside autovectorization) for users.

 Today, the model is very simple: people who want to build on top of vector
 operations just bind to the builtins they need and expose higher level
 interfaces if they like, provided proper type definitions (see g-sse.ads
 for
 example).


 It is sad that this prevents us from removing the builtins, but I agree that
 we can't just drop ada+sse users like that. Well, less work for me if I
 don't have to remove the builtins, and my main motivation is optimization,
 even if I tried to sell the clean up to convince people.

 Uros, is it still ok if I change the intrinsics without removing the
 builtins? (with testcases for HJ and not before Kirill says it is ok)

Given that this will be a substantial work and considering the request
from Kirill, what do you think about separate development branch until
AVXn stuff is finished? This will give a couple of weeks and a
playground to finalize the approach for the conversion. Maybe even ada
can be tested there to not regress with the compatibility stuff.

Uros.


Re: [i386] Replace builtins with vector extensions

2014-10-09 Thread Marc Glisse

On Thu, 9 Oct 2014, Uros Bizjak wrote:


Given that this will be a substantial work and considering the request
from Kirill, what do you think about separate development branch until
AVXn stuff is finished? This will give a couple of weeks and a
playground to finalize the approach for the conversion. Maybe even ada
can be tested there to not regress with the compatibility stuff.


No problem. We can also wait until next stage1 if you believe the release 
of gcc-5 is too close.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-07-29 Thread Kirill Yukhin
Hello Marc,
On 26 Jul 19:34, Marc Glisse wrote:
 I did some AVX and AVX512F intrinsics, and it still passes the
 testsuite (on my old pre-AVX x86_64-linux-gnu).

I've performed testing of your patch using functional simulator of
AVX*. And see no regressions as well.

--
Thanks, K


Re: [i386] Replace builtins with vector extensions

2014-07-26 Thread Marc Glisse

On Tue, 8 Jul 2014, Kirill Yukhin wrote:


Hello Marc.
On 04 Jul 21:11, Marc Glisse wrote:

On Thu, 3 Jul 2014, Kirill Yukhin wrote:
like combining 2 shuffles unless the result is the identity. And
expanding shuffles that can be done in a single instruction works
well.

But I am happy not doing them yet. To be very specific, could you
list which intrinsics you would like to remove from the posted
patch?

I am not a x86 maintainer, however while such a replacements produce
correct semantics and probably enable optimizations, I support your patch.

Probably you could try such your approach on AVX2, AVX-512 whose intrinsics
are well covered by tests?


I did some AVX and AVX512F intrinsics, and it still passes the testsuite 
(on my old pre-AVX x86_64-linux-gnu).



2014-07-26  Marc Glisse  marc.gli...@inria.fr

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/avxintrin.h (_mm256_add_pd, _mm256_add_ps,
_mm256_div_pd, _mm256_div_ps, _mm256_mul_pd, _mm256_mul_ps,
_mm256_sub_pd, _mm256_sub_ps): Likewise.
* config/i386/avx512fintrin.h (_mm512_add_pd, _mm512_add_ps,
_mm512_sub_pd, _mm512_sub_ps, _mm512_mul_pd, _mm512_mul_ps,
_mm512_div_pd, _mm512_div_ps): Likewise.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.


--
Marc GlisseIndex: gcc/config/i386/avx512fintrin.h
===
--- gcc/config/i386/avx512fintrin.h (revision 213083)
+++ gcc/config/i386/avx512fintrin.h (working copy)
@@ -10598,26 +10598,21 @@ _mm512_maskz_sqrt_ps (__mmask16 __U, __m
 (__v16sf)
 _mm512_setzero_ps (),
 (__mmask16) __U,
 _MM_FROUND_CUR_DIRECTION);
 }
 
 extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_add_pd (__m512d __A, __m512d __B)
 {
-  return (__m512d) __builtin_ia32_addpd512_mask ((__v8df) __A,
-(__v8df) __B,
-(__v8df)
-_mm512_undefined_pd (),
-(__mmask8) -1,
-_MM_FROUND_CUR_DIRECTION);
+  return __A + __B;
 }
 
 extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_mask_add_pd (__m512d __W, __mmask8 __U, __m512d __A, __m512d __B)
 {
   return (__m512d) __builtin_ia32_addpd512_mask ((__v8df) __A,
 (__v8df) __B,
 (__v8df) __W,
 (__mmask8) __U,
@@ -10633,26 +10628,21 @@ _mm512_maskz_add_pd (__mmask8 __U, __m51
 (__v8df)
 _mm512_setzero_pd (),
 (__mmask8) __U,
 _MM_FROUND_CUR_DIRECTION);
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_add_ps (__m512 __A, __m512 __B)
 {
-  return (__m512) __builtin_ia32_addps512_mask ((__v16sf) __A,
-   (__v16sf) __B,
-   (__v16sf)
-   _mm512_undefined_ps (),
-   (__mmask16) -1,
-   _MM_FROUND_CUR_DIRECTION);
+  return __A + __B;
 }
 
 extern __inline __m512
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_mask_add_ps (__m512 __W, __mmask16 __U, __m512 __A, __m512 __B)
 {
   return (__m512) __builtin_ia32_addps512_mask ((__v16sf) __A,
(__v16sf) __B,
(__v16sf) __W,
(__mmask16) __U,
@@ -10668,26 +10658,21 @@ _mm512_maskz_add_ps (__mmask16 __U, __m5
(__v16sf)
_mm512_setzero_ps (),
(__mmask16) __U,
_MM_FROUND_CUR_DIRECTION);
 }
 
 extern __inline __m512d
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 

Re: [i386] Replace builtins with vector extensions

2014-07-08 Thread Kirill Yukhin
Hello Marc.
On 04 Jul 21:11, Marc Glisse wrote:
 On Thu, 3 Jul 2014, Kirill Yukhin wrote:
 like combining 2 shuffles unless the result is the identity. And
 expanding shuffles that can be done in a single instruction works
 well.
 
 But I am happy not doing them yet. To be very specific, could you
 list which intrinsics you would like to remove from the posted
 patch?
I am not a x86 maintainer, however while such a replacements produce
correct semantics and probably enable optimizations, I support your patch.

Probably you could try such your approach on AVX2, AVX-512 whose intrinsics
are well covered by tests?

 On the over hand, updated in such a way intrinsic
 may actually generate different instruction then intended (e.g. FMA case).
 
 It is the same with scalars, we have -ffp-contract for that.
Agreed.

--
Thanks, K


Re: [i386] Replace builtins with vector extensions

2014-07-08 Thread Jakub Jelinek
On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote:
  On the over hand, updated in such a way intrinsic
  may actually generate different instruction then intended (e.g. FMA case).
  
  It is the same with scalars, we have -ffp-contract for that.
 Agreed.

I don't think we actually always guarantee using the particular instructions
for the intrinsics even when they are implemented using builtins, at least
if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
something into some other insn, we'll happily do that.

Jakub


Re: [i386] Replace builtins with vector extensions

2014-07-08 Thread Mike Stump
On Jul 8, 2014, at 4:17 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote:
 On the over hand, updated in such a way intrinsic
 may actually generate different instruction then intended (e.g. FMA case).
 
 It is the same with scalars, we have -ffp-contract for that.
 Agreed.
 
 I don't think we actually always guarantee using the particular instructions
 for the intrinsics even when they are implemented using builtins, at least
 if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
 something into some other insn, we'll happily do that.

In a testcase, one is free to hide the inputs and the output from the optimizer 
using standard tricks and take one step closer to having a 1-1 mapping.  Of 
course, wether or not the port even offers a 1-1 mapping for any particular 
builtin is completely dependent upon the port.

Re: [i386] Replace builtins with vector extensions

2014-07-04 Thread Marc Glisse

On Thu, 3 Jul 2014, Kirill Yukhin wrote:


Hello Marc,
On 28 Jun 12:42, Marc Glisse wrote:

It would enable a number of optimizations, like constant
propagation, FMA contraction, etc. It would also allow us to remove
several builtins.

This should be main motivation for replacing built-ins.
But this approach IMHO should only be used for `obvious' cases only.
I mean: + - / * and friends.
Think that this shouldn't apply for shuffles, broadcasts.
But we have to define border between `obvious' and rest intrinsics.


We don't have a syntax in the front-end for broadcasts anyway, but are you 
sure about shuffles? __builtin_shuffle directly translates to 
VEC_PERM_EXPR, on which we are careful to avoid optimizations like 
combining 2 shuffles unless the result is the identity. And expanding 
shuffles that can be done in a single instruction works well.


But I am happy not doing them yet. To be very specific, could you list 
which intrinsics you would like to remove from the posted patch?



On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).


It is the same with scalars, we have -ffp-contract for that.


For ICC this is generally OK to generate different instructions, only
semantics should be obeyed.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-07-03 Thread Kirill Yukhin
Hello Marc,
On 28 Jun 12:42, Marc Glisse wrote:
 It would enable a number of optimizations, like constant
 propagation, FMA contraction, etc. It would also allow us to remove
 several builtins.
This should be main motivation for replacing built-ins.
But this approach IMHO should only be used for `obvious' cases only.
I mean: + - / * and friends.
Think that this shouldn't apply for shuffles, broadcasts.
But we have to define border between `obvious' and rest intrinsics.

On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).

For ICC this is generally OK to generate different instructions, only
semantics should be obeyed.

--
Thanks, K


Re: [i386] Replace builtins with vector extensions

2014-06-29 Thread Ulrich Drepper
On Sat, Jun 28, 2014 at 6:53 PM, Marc Glisse marc.gli...@inria.fr wrote:
 There is always a risk, but then even with builtins I think there was a
 small risk that an RTL optimization would mess things up. It is indeed
 higher if we expose the operation to the optimizers earlier, but it would be
 a bug if an optimization replaced a vector operation by something worse.
 Also, I am only proposing to handle the most trivial operations this way,
 not more complicated ones (like v[0]+=s) where we would be likely to fail
 generating the right instruction. And the pragma should ensure that the
 function will always be compiled in a mode where the vector instruction is
 available.

 ARM did the same and I don't think I have seen a bug reporting a regression
 about it (I haven't really looked though).

I think the Arm definitions come from a different angle.  It's new,
there is no assumed semantics.  For the x86 intrinsics Intel defines
that _mm_xxx() generates one of a given opcodes if there is a match.
If I want to generate a specific code sequence I use the intrinsics.
Otherwise I could already today use the vector type semantics myself.

Don't get me wrong, I like the idea to have the optimization of the
intrinsics happening.  But perhaps not unconditionally or at least not
without preventing them.

I know this will look ugly, but how about a macro
__GCC_X86_HONOR_INTRINSICS to enable the current code and have by
default your proposed use of the vector arithmetic in place?  This
wouldn't allow removing support for the built-ins but it would also
open the door to some more risky optimizations to be enabled by
default.


Re: [i386] Replace builtins with vector extensions

2014-06-29 Thread Marc Glisse

On Sun, 29 Jun 2014, Ulrich Drepper wrote:


I think the Arm definitions come from a different angle.  It's new,
there is no assumed semantics.


Is it that new? I thought it was implemented based on a rather precise 
specification by ARM. Again, I don't really know arm.



For the x86 intrinsics Intel defines
that _mm_xxx() generates one of a given opcodes if there is a match.
If I want to generate a specific code sequence I use the intrinsics.


We already sometimes generate a different instruction than the name of the 
instrinsic suggests, or combine consecutive intrinsics into something 
else. I use inline asm when I want a specific code sequence.



Otherwise I could already today use the vector type semantics myself.


Well, the main reasons I use the intrinsics are:
1) the code compiles with visual studio
2) use the esoteric instructions (anything without a trivial mapping in C)


Don't get me wrong, I like the idea to have the optimization of the
intrinsics happening.  But perhaps not unconditionally or at least not
without preventing them.

I know this will look ugly, but how about a macro
__GCC_X86_HONOR_INTRINSICS to enable the current code and have by
default your proposed use of the vector arithmetic in place?  This
wouldn't allow removing support for the built-ins but it would also
open the door to some more risky optimizations to be enabled by
default.


That's a pretty big drawback. Instead of simplifying the implementation, 
it makes it more complicated. We also have to document the macro, update 
the testsuite so it tests the intrinsics in both modes, etc.


I understand the concern, and I would probably implement 
__GCC_X86_HONOR_INTRINSICS (though the testsuite part scares me as I have 
so little understanding of how it works, so I may need help), but I'd like 
to make sure first that the simpler approach is not acceptable, possibly 
with strong constraints on which operations are ok (_mm_load[hl]_pd could 
be removed from the patch for instance).


As another comparison, clang's version of *intrin.h uses the vector 
extensions much more than I am proposing.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-06-28 Thread Marc Glisse

Ping,

nobody has an opinion on this? Or some explanation why I am mistaken to 
believe that #pragma target makes it safer now?


It would enable a number of optimizations, like constant propagation, FMA 
contraction, etc. It would also allow us to remove several builtins.


On Sat, 17 May 2014, Marc Glisse wrote:


Ping

On Mon, 28 Apr 2014, Marc Glisse wrote:


Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html

(note that ARM seems to be doing the same thing for their neon intrinsics, 
see Ramana's patch series posted today)


On Fri, 11 Apr 2014, Marc Glisse wrote:


Hello,

the previous discussion on the topic was before we added all those #pragma 
target in *mmintrin.h:


http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that 
I only did a few of the more obvious intrinsics, I am waiting to see if 
this patch is accepted before doing more.


Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11  Marc Glisse  marc.gli...@inria.fr

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-06-28 Thread Marc Glisse

On Sat, 28 Jun 2014, Ulrich Drepper wrote:


On Sat, Jun 28, 2014 at 6:42 AM, Marc Glisse marc.gli...@inria.fr wrote:

Ping,

nobody has an opinion on this? Or some explanation why I am mistaken to
believe that #pragma target makes it safer now?

It would enable a number of optimizations, like constant propagation, FMA
contraction, etc. It would also allow us to remove several builtins.


I see no problem with using the array-type access to the registers.

As for replacing the builtins with arithmetic operators: I appreciate
the possibility for optimization.  But is there any chance the calls
could not end up being implemented with a vector instruction?  I think
that would be bad.  The intrinsics should be a way to guarantee that
the programmer can create vector instructions.  Otherwise we might
just not support them.


There is always a risk, but then even with builtins I think there was a 
small risk that an RTL optimization would mess things up. It is indeed 
higher if we expose the operation to the optimizers earlier, but it would 
be a bug if an optimization replaced a vector operation by something 
worse. Also, I am only proposing to handle the most trivial operations 
this way, not more complicated ones (like v[0]+=s) where we would be 
likely to fail generating the right instruction. And the pragma should 
ensure that the function will always be compiled in a mode where the 
vector instruction is available.


ARM did the same and I don't think I have seen a bug reporting a 
regression about it (I haven't really looked though).


Thanks,

--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-05-17 Thread Marc Glisse

Ping

On Mon, 28 Apr 2014, Marc Glisse wrote:


Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html

(note that ARM seems to be doing the same thing for their neon intrinsics, 
see Ramana's patch series posted today)


On Fri, 11 Apr 2014, Marc Glisse wrote:


Hello,

the previous discussion on the topic was before we added all those #pragma 
target in *mmintrin.h:


http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that 
I only did a few of the more obvious intrinsics, I am waiting to see if 
this patch is accepted before doing more.


Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11  Marc Glisse  marc.gli...@inria.fr

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, 
_mm_storeh_pd,

_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2014-04-28 Thread Marc Glisse

Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html

(note that ARM seems to be doing the same thing for their neon 
intrinsics, see Ramana's patch series posted today)


On Fri, 11 Apr 2014, Marc Glisse wrote:


Hello,

the previous discussion on the topic was before we added all those #pragma 
target in *mmintrin.h:


http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that I 
only did a few of the more obvious intrinsics, I am waiting to see if this 
patch is accepted before doing more.


Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11  Marc Glisse  marc.gli...@inria.fr

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, 
_mm_storeh_pd,

_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.


--
Marc Glisse


[i386] Replace builtins with vector extensions

2014-04-11 Thread Marc Glisse

Hello,

the previous discussion on the topic was before we added all those #pragma 
target in *mmintrin.h:


http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that 
I only did a few of the more obvious intrinsics, I am waiting to see if 
this patch is accepted before doing more.


Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11  Marc Glisse  marc.gli...@inria.fr

* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
instead of builtins.
* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
_mm_loadh_pd, _mm_loadl_pd): Likewise.
(_mm_sqrt_sd): Fix comment.

--
Marc GlisseIndex: gcc/config/i386/emmintrin.h
===
--- gcc/config/i386/emmintrin.h (revision 209323)
+++ gcc/config/i386/emmintrin.h (working copy)
@@ -161,40 +161,40 @@ _mm_store_pd (double *__P, __m128d __A)
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_storeu_pd (double *__P, __m128d __A)
 {
   __builtin_ia32_storeupd (__P, __A);
 }
 
 /* Stores the lower DPFP value.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_store_sd (double *__P, __m128d __A)
 {
-  *__P = __builtin_ia32_vec_ext_v2df (__A, 0);
+  *__P = __A[0];
 }
 
 extern __inline double __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cvtsd_f64 (__m128d __A)
 {
-  return __builtin_ia32_vec_ext_v2df (__A, 0);
+  return __A[0];
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_storel_pd (double *__P, __m128d __A)
 {
   _mm_store_sd (__P, __A);
 }
 
 /* Stores the upper DPFP value.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_storeh_pd (double *__P, __m128d __A)
 {
-  *__P = __builtin_ia32_vec_ext_v2df (__A, 1);
+  *__P = __A[1];
 }
 
 /* Store the lower DPFP value across two words.
The address must be 16-byte aligned.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_store1_pd (double *__P, __m128d __A)
 {
   _mm_store_pd (__P, __builtin_ia32_shufpd (__A, __A, _MM_SHUFFLE2 (0,0)));
 }
 
@@ -215,86 +215,86 @@ extern __inline int __attribute__((__gnu
 _mm_cvtsi128_si32 (__m128i __A)
 {
   return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0);
 }
 
 #ifdef __x86_64__
 /* Intel intrinsic.  */
 extern __inline long long __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cvtsi128_si64 (__m128i __A)
 {
-  return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0);
+  return __A[0];
 }
 
 /* Microsoft intrinsic.  */
 extern __inline long long __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cvtsi128_si64x (__m128i __A)
 {
-  return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0);
+  return __A[0];
 }
 #endif
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
+  return __A + __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
+  return __A - __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
+  return __A * __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_div_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_divpd ((__v2df)__A, (__v2df)__B);
+  return __A / __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_div_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_divsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d 

Re: [i386] Replace builtins with vector extensions

2014-02-05 Thread Marc Glisse

Hello,

I was wondering if the new #pragma target in *mmintrin.h make this 
approach more acceptable for 4.10?


http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

On Sun, 7 Apr 2013, Marc Glisse wrote:


Hello,

the attached patch is very incomplete (it passes bootstrap+testsuite on 
x86_64-linux-gnu), but it raises a number of questions that I'd like to 
settle before continuing.


* Is there any chance of a patch in this direction being accepted?

* May I remove the builtins (from i386.c and the doc) when they become 
unused?


* Do we want to keep the casts even when they don't seem strictly necessary? 
For instance for _mm_add_ps, we can write:

return __A + __B;
or:
return (__m128) ((__v4sf)__A + (__v4sf)__B);
Note that for _mm_add_epi8 for instance we do need the casts.

* For integer operations like _mm_add_epi16 I should probably use the 
unsigned typedefs to make it clear overflow is well defined? (the patch still 
has the signed version)


* Any better name than __v4su for the unsigned version of __v4si?

* Other comments?


2013-04-07  Marc Glisse  marc.gli...@inria.fr

* emmintrin.h (__v2du, __v4su, __v8hu): New typedefs.
(_mm_add_pd, _mm_sub_pd, _mm_mul_pd, _mm_div_pd,
_mm_cmpeq_pd, _mm_cmplt_pd, _mm_cmple_pd, _mm_cmpgt_pd, _mm_cmpge_pd,
_mm_cmpneq_pd, _mm_add_epi8, _mm_add_epi16, _mm_add_epi32,
_mm_add_epi64, _mm_slli_epi16, _mm_slli_epi32, _mm_slli_epi64,
_mm_srai_epi16, _mm_srai_epi32, _mm_srli_epi16, _mm_srli_epi32,
_mm_srli_epi64): Replace builtins with vector extensions.
* xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps,
_mm_cmpeq_ps, _mm_cmplt_ps, _mm_cmple_ps, _mm_cmpgt_ps, _mm_cmpge_ps,
_mm_cmpneq_ps): Likewise.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-10 Thread Richard Biener
On Tue, Apr 9, 2013 at 9:15 PM, Marc Glisse marc.gli...@inria.fr wrote:
 On Tue, 9 Apr 2013, Marc Glisse wrote:

 On Tue, 9 Apr 2013, Richard Biener wrote:

 I seem to remember discussion in the PR(s) that the intrinsics should
 (and do for other compilers) expand to the desired instructions even when
 the corresponding instruction set is disabled.


 emmintrin.h starts with:
 #ifndef __SSE2__
 # error SSE2 instruction set not enabled


 Oh, re-reading your post, it looks like you mean we should change the
 current behavior, not just avoid regressions...

 My opinion on the intrinsics is that they are the portable way to use
 vectors on x86, but they are not equivalent to asm (which people should use
 if they don't want the compiler looking at their code). Knowingly generating
 SSE code with -mno-sse is not very appealing.

 However, the arguments in:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56298
 make sense. I guess I'll forget about this patch.

Note that to fully support emitting intrinsics correctly even without -msse
x86 specific builtins need to be used and they need to conditionally expand
to either UNSPECs (if the required instriuction set / modes are not available)
or regular RTL (where they can be folded to generic GIMPLE earlier
then as well).
A complication is register allocation which would need to understand how to
allocate registers for the UNSPECs - even if some of the modes would not
be available.  So it's indeed a mess ...

That said, folding of the x86 builtins to GIMPLE looks like a more
viable approach
that would not interfere too much with any possible route we would go here.
As suggested previously please add a new target hook with the same interface
as fold_stmt in case you want to work on this.

Thanks,
Richard.

 --
 Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-09 Thread Richard Biener
On Mon, Apr 8, 2013 at 10:47 PM, Marc Glisse marc.gli...@inria.fr wrote:
 On Sun, 7 Apr 2013, Marc Glisse wrote:

  extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__,
 __artificial__))
  _mm_slli_epi16 (__m128i __A, int __B)
  {
 -  return (__m128i)__builtin_ia32_psllwi128 ((__v8hi)__A, __B);
 +  return (__m128i) ((__v8hi)__A  __B);
  }


 Actually, I believe I have to keep using the builtins for shifts, because
 the intrinsics have well defined behavior for large __B whereas  and 
 don't.

I seem to remember discussion in the PR(s) that the intrinsics should
(and do for other compilers) expand to the desired instructions even when
the corresponding instruction set is disabled.  Using vector extension
makes that harder to achieve.  Other than that I am all for using the
vector extensions, but I think you need carefully wrapped __extension__
markers so that with -std=c89 -pedantic you still can compile programs
using the intrinsics?

Richard.

 --
 Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-09 Thread Marc Glisse

On Tue, 9 Apr 2013, Richard Biener wrote:


On Mon, Apr 8, 2013 at 10:47 PM, Marc Glisse marc.gli...@inria.fr wrote:

On Sun, 7 Apr 2013, Marc Glisse wrote:


 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
 _mm_slli_epi16 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_psllwi128 ((__v8hi)__A, __B);
+  return (__m128i) ((__v8hi)__A  __B);
 }



Actually, I believe I have to keep using the builtins for shifts, because
the intrinsics have well defined behavior for large __B whereas  and 
don't.


I seem to remember discussion in the PR(s) that the intrinsics should
(and do for other compilers) expand to the desired instructions even when
the corresponding instruction set is disabled.


emmintrin.h starts with:
#ifndef __SSE2__
# error SSE2 instruction set not enabled

The closest thing I can think of is issues with -mfpmath=387, but that 
shouldn't matter for full vector ops.



Using vector extension
makes that harder to achieve.  Other than that I am all for using the
vector extensions, but I think you need carefully wrapped __extension__
markers so that with -std=c89 -pedantic you still can compile programs
using the intrinsics?


The *intrin.h files already use __extension__ to create vectors, like:
  return __extension__ (__m128d){ __F, 0.0 };
but even when I remove it it does not warn with -std=c89 -pedantic.

--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-09 Thread Jakub Jelinek
On Tue, Apr 09, 2013 at 11:08:38AM +0200, Marc Glisse wrote:
 The *intrin.h files already use __extension__ to create vectors, like:
   return __extension__ (__m128d){ __F, 0.0 };
 but even when I remove it it does not warn with -std=c89 -pedantic.

Even with -Wsystem-headers ?

Jakub


Re: [i386] Replace builtins with vector extensions

2013-04-09 Thread Marc Glisse

On Tue, 9 Apr 2013, Jakub Jelinek wrote:


On Tue, Apr 09, 2013 at 11:08:38AM +0200, Marc Glisse wrote:

The *intrin.h files already use __extension__ to create vectors, like:
  return __extension__ (__m128d){ __F, 0.0 };
but even when I remove it it does not warn with -std=c89 -pedantic.


Even with -Wsystem-headers ?


Oups ;-)

Ok, removing the existing __extension__ causes warnings (note that it can 
easily be worked around by initializing a variable instead of this 
compound literal, so it isn't vectors that pedantic complains about), but 
my changes do not warn.


--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-09 Thread Marc Glisse

On Tue, 9 Apr 2013, Marc Glisse wrote:


On Tue, 9 Apr 2013, Richard Biener wrote:


I seem to remember discussion in the PR(s) that the intrinsics should
(and do for other compilers) expand to the desired instructions even when
the corresponding instruction set is disabled.


emmintrin.h starts with:
#ifndef __SSE2__
# error SSE2 instruction set not enabled


Oh, re-reading your post, it looks like you mean we should change the 
current behavior, not just avoid regressions...


My opinion on the intrinsics is that they are the portable way to use 
vectors on x86, but they are not equivalent to asm (which people should 
use if they don't want the compiler looking at their code). Knowingly 
generating SSE code with -mno-sse is not very appealing.


However, the arguments in:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56298
make sense. I guess I'll forget about this patch.

--
Marc Glisse


Re: [i386] Replace builtins with vector extensions

2013-04-08 Thread Marc Glisse

On Sun, 7 Apr 2013, Marc Glisse wrote:


 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
 _mm_slli_epi16 (__m128i __A, int __B)
 {
-  return (__m128i)__builtin_ia32_psllwi128 ((__v8hi)__A, __B);
+  return (__m128i) ((__v8hi)__A  __B);
 }


Actually, I believe I have to keep using the builtins for shifts, because 
the intrinsics have well defined behavior for large __B whereas  and  
don't.


--
Marc Glisse


[i386] Replace builtins with vector extensions

2013-04-07 Thread Marc Glisse

Hello,

the attached patch is very incomplete (it passes bootstrap+testsuite on 
x86_64-linux-gnu), but it raises a number of questions that I'd like to 
settle before continuing.


* Is there any chance of a patch in this direction being accepted?

* May I remove the builtins (from i386.c and the doc) when they become unused?

* Do we want to keep the casts even when they don't seem strictly 
necessary? For instance for _mm_add_ps, we can write:

return __A + __B;
or:
return (__m128) ((__v4sf)__A + (__v4sf)__B);
Note that for _mm_add_epi8 for instance we do need the casts.

* For integer operations like _mm_add_epi16 I should probably use the 
unsigned typedefs to make it clear overflow is well defined? (the patch 
still has the signed version)


* Any better name than __v4su for the unsigned version of __v4si?

* Other comments?


2013-04-07  Marc Glisse  marc.gli...@inria.fr

* emmintrin.h (__v2du, __v4su, __v8hu): New typedefs.
(_mm_add_pd, _mm_sub_pd, _mm_mul_pd, _mm_div_pd,
_mm_cmpeq_pd, _mm_cmplt_pd, _mm_cmple_pd, _mm_cmpgt_pd, _mm_cmpge_pd,
_mm_cmpneq_pd, _mm_add_epi8, _mm_add_epi16, _mm_add_epi32,
_mm_add_epi64, _mm_slli_epi16, _mm_slli_epi32, _mm_slli_epi64,
_mm_srai_epi16, _mm_srai_epi32, _mm_srli_epi16, _mm_srli_epi32,
_mm_srli_epi64): Replace builtins with vector extensions.
* xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps,
_mm_cmpeq_ps, _mm_cmplt_ps, _mm_cmple_ps, _mm_cmpgt_ps, _mm_cmpge_ps,
_mm_cmpneq_ps): Likewise.


--
Marc GlisseIndex: config/i386/xmmintrin.h
===
--- config/i386/xmmintrin.h (revision 197549)
+++ config/i386/xmmintrin.h (working copy)
@@ -147,39 +147,39 @@ extern __inline __m128 __attribute__((__
 _mm_max_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_maxss ((__v4sf)__A, (__v4sf)__B);
 }
 
 /* Perform the respective operation on the four SPFP values in A and B.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
+  return __A + __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_sub_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subps ((__v4sf)__A, (__v4sf)__B);
+  return __A - __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_mul_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_mulps ((__v4sf)__A, (__v4sf)__B);
+  return __A * __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_div_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_divps ((__v4sf)__A, (__v4sf)__B);
+  return __A / __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_sqrt_ps (__m128 __A)
 {
   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_rcp_ps (__m128 __A)
@@ -323,51 +323,51 @@ _mm_cmpunord_ss (__m128 __A, __m128 __B)
   return (__m128) __builtin_ia32_cmpunordss ((__v4sf)__A, (__v4sf)__B);
 }
 
 /* Perform a comparison on the four SPFP values of A and B.  For each
element, if the comparison is true, place a mask of all ones in the
result, otherwise a mask of zeros.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmpeq_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpeqps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A == __B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmplt_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpltps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A  __B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmple_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpleps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A = __B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmpgt_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpgtps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A  __B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmpge_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpgeps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A = __B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, 
__artificial__))
 _mm_cmpneq_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_cmpneqps ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) (__A 

Re: [i386] Replace builtins with vector extensions

2013-04-07 Thread Marc Glisse
By the way, the comment in emmintrin.h in front of _mm_sqrt_sd seems 
wrong:


/* Return pair {sqrt (A[0), B[1]}.  */

It should be instead:

/* Return pair {sqrt (B[0]), A[1]}.  */

If you agree I'll fix that independently.

--
Marc Glisse