Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in the example below (see https://godbolt.org/z/qnfT4fE5G )
convert and covert3 produce code that looks to me inefficient w/r/t
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484
--- Comment #9 from vincenzo Innocente ---
We observe that including xmmintrin.h the behaviour of some code,
notably abs(x), when x is float or double changes.
And this depends on the platform as xmmintrin.h is x86_64 specific.
Yes, is 20
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484
--- Comment #4 from vincenzo Innocente ---
in C++ one is supposed to #include
not
I do not think that there is an explicit version of C++ headers for the
intrinsics that avoids the conflicts between C and C++.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484
--- Comment #2 from vincenzo Innocente ---
*** Bug 114483 has been marked as a duplicate of this bug. ***
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114483
vincenzo Innocente changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484
--- Comment #1 from vincenzo Innocente ---
xmmintrin.h
includes mm_malloc.h
which
#include
which
using std::abs;
(among others)
see
https://godbolt.org/z/cxo65rnr9
or this excerpt from c++ -E dump
```
# 32
++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114363
--- Comment #4 from vincenzo Innocente ---
Thanks Harald, I missed the point that float z = pow(double(x),2) and
float z = x*x would indeed produce exactly the same result, while in all other
cases of course not.
: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
while pow(x,2) is optimized in x*x (float x)
in pow(x,2)+pow(y,2) x and y are first promoted to double
which I find inconsistent
see
https
Severity: normal
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
Created attachment 56657
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56657=e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112348
--- Comment #1 from vincenzo Innocente ---
This patch works for me
diff --git a/libstdc++-v3/include/std/stacktrace
b/libstdc++-v3/include/std/stacktrace
index da0e48d3532..9a0d0b16068 100644
--- a/libstdc++-v3/include/std/stacktrace
+++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
--- Comment #12 from vincenzo Innocente ---
confirm that the patch solves the issue
c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o
liba.so -ldl;c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
--- Comment #8 from vincenzo Innocente ---
Thanks Ian for the patch.
For testing I will need the full git diff (including the makefile itself as my
autoconf is not compatible with gcc14).
Backports down to gcc12 will be appreciated.
Could you
P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC)
auto k = std::hash()(std::stacktrace::current());
does not compile to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
--- Comment #6 from vincenzo Innocente ---
Sorry, made the (almost) full exercise:
read the doc in
https://en.cppreference.com/w/cpp/utility/stacktrace_entry
and the code in stacktrace header file and in
libstdc++-v3/src/c++23/stacktrace.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
--- Comment #5 from vincenzo Innocente ---
so if I add to
std::cout << std::stacktrace::current() << '\n';
I get what needed
Dl_info dlinfo;
for (auto & entry : std::stacktrace::current() ) {
dladdr((const
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
--- Comment #4 from vincenzo Innocente ---
intel x86_64
uname -a
Linux patatrack01 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Thu May 18 10:27:05 EDT
2023 x86_64 x86_64 x86_64 GNU/Linux
boost::backtrace works
can provide example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263
vincenzo Innocente changed:
What|Removed |Added
CC||ian at gcc dot gnu.org
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
using
gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC)
that contains the fix for #111936
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936
--- Comment #9 from vincenzo Innocente ---
Thanks for the second patch.
I was indeed struggling with autoconf versions (1.15 vd 1.16)
Any chance to backport to gcc12 (our current production version)?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936
--- Comment #7 from vincenzo Innocente ---
not explicitly in the src tree.
only run configure in the build directory.
what I need to run in the src tree?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936
--- Comment #5 from vincenzo Innocente ---
My bad, long time I'm not using archive libraries and forgot about the order
rule.
The issue is indeed missing -fPIC.
Thanks for the fast action.
I applied the patch but it seems not sufficient.
If
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934
--- Comment #3 from vincenzo Innocente ---
with
gcc version 14.0.0 20231024 (experimental) [master r14-4877-g724badcadf8] (GCC)
I get the same ICE.
Please note that one needs to include "iostream"
(in my test compile with "-DICE")
to trigger
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936
--- Comment #1 from vincenzo Innocente ---
here is a minimal malloc hook that I would like to use
[innocent@patatrack01 ctest]$ cat getStacktrace.cc
#include
std::string get_stacktrace() {
std::string trace;
for (auto & entry :
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
I would like to use std::stacktrace in a shared library to be preloaded...
when I try to build the library even for this minimal example
cat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934
--- Comment #1 from vincenzo Innocente ---
sorry missed the version
gcc version 14.0.0 20231021 (experimental) [master r14-4817-g405a4140fc3] (GCC)
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
#ifdef ICE
#include
#endif
struct Me {
static Me & me() {
thread_local auto me =
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in this simple code (on avx2)
int sum(float const * x) {
int ret = 0;
for (int i=0; i<8; ++i) ret +
: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
In the following (almost real) code gcc emits suboptimal code if std::optional
is used w/r/t home made one and clang
see https://godbolt.org/z/Pba51Ye7Y
-
code
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in the following code foo does not vectorize, bar does.
clang vectorize foo using a pattern that invokes vplzcntd
(code made a bit complex to make
: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in the following code [1] foo does not vectorize, bar doos
compiled with -march=haswell -Ofast --no-math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108677
--- Comment #3 from vincenzo Innocente ---
sorry. the original internal bug report was for gcc 7.5
https://godbolt.org/z/9crafbqen
where I think the generated code is indeed wrong (and does not depend on the
presence of the constructor!)
SO,
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in this real life code
#include
struct trig_pair {
double CosPhi;
double SinPhi;
trig_pair() : CosPhi(1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012
--- Comment #6 from vincenzo Innocente ---
just to confirm that
-OfastĀ -fno-reciprocal-math -mno-recip
seems to inhibit all reciprocals...
https://godbolt.org/z/f4bccb9GP
: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
on x86_64
float f(float x) { return std::sqrt(x);}
compiles in
sqrtss xmm0, xmm0
even if --no-builtin is provided
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012
vincenzo Innocente changed:
What|Removed |Added
Summary|rsqrtss instruction |rsqrtps and rcpps
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
with option -Ofast -mno-recip rsqrtss instruction is still generated.
https://godbolt.org/z/hGxrG7xPh
inhibiting rsqrtss and rcpss
-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
In this example GCC fails to emit branchless code while CLANG does.
In the actual application, measurements shows slow down up to a factor 2.
I managed to force
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97707
--- Comment #3 from vincenzo Innocente ---
the main point in using -mprefer-vector-width=256 is to avoid clock throttling
in "mixed" workloads.
In small benchmarks like this one avx512 is faster (even on an old Silver) even
if trigger a slower
: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
this code will invoke _ZGVeN8v_sin instead of _ZGVdN4v_sin making use of zmm
registers
#include
int main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335
--- Comment #3 from vincenzo Innocente ---
Understood for float
it seems to me that the transformation does not occur for integer neither
(signed or unsigned)
as in
using T= unsigned int;
T bar(T const * __restrict__ x,
T const * __restrict__
-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in the following code (compiled with -O2 or -O3 and even with -march=haswell)
gcc will use a branchless construct in foo but not in bar (changing from float
to int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88598
--- Comment #3 from vincenzo Innocente ---
what I am interested in is NOT a constant array, more a small-size
"sparse"-matrix that I can build explicitly at run time from other sources.
I have examples using Eigen if of any interest (
: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
g++ fails to optimize the code below
even with -Ofast https://godbolt.org/z/mYRgVX
independently of vectorization options https://godbolt.org/z/XMnCNz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855
--- Comment #5 from vincenzo Innocente ---
I have indeed worked-around with
const __m128i neg = _mm_set_epi32(0,0,0x8000,0);
__m128i ret = __m128i(_mm_sub_ps(v5, v3));
return __m128(_mm_xor_si128(ret,neg));
const __m256i neg =
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855
--- Comment #3 from vincenzo Innocente ---
looks more undefined behavior as
const __m128 neg = _mm_set_ps(0.0f,0.0f,-0.0f,-0.0f);
return _mm_xor_ps(_mm_sub_ps(v5, v3), neg);
with -O3 compiles in
xorps .LC0(%rip), %xmm0
ret
.LC0:
.long
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
this function
_m128 _mm_cross_ps(__m128 v1, __m128 v2) {
// same order is _MM_SHUFFLE(3,2,1,0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83857
--- Comment #2 from vincenzo Innocente ---
(In reply to Richard Biener from comment #1)
> I've seen a similar bug so maybe fixed already.
if the similar bug is #83753 it is looks "fixed" in the version I tested
(at least
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
Created attachment 43133
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43133=edit
directory with all fi
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in this example
#include
int * foo() {
int * p = new int[16];
memset(p,0,16*sizeof(int));
return p;
}
int * foo(int * q) {
int * p = new int[16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390
--- Comment #19 from vincenzo Innocente ---
Could you please have a look also to c++ and lto: this is what I get on my
skylake:
for c++ or lto -fno-split-paths pessimizes
[innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm ;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390
--- Comment #17 from vincenzo Innocente ---
[innocent@vinavx3 innocent]$ mkdir scimark2TMP
[innocent@vinavx3 innocent]$ cd scimark2TMP
[innocent@vinavx3 scimark2TMP]$ wget
http://math.nist.gov/scimark2/scimark2_1c.zip .
.
gcc version 7.0.1
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
Created attachment 41125
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41125=edit
sef contained scimark2 MC benchmark
just got hold of a AMD Ryze
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80248
--- Comment #2 from vincenzo Innocente ---
side note: the difference is timing between "aos2" and "soa" seems to be fully
accounted by the integer multiplication "3*k[i]".
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232
--- Comment #5 from vincenzo Innocente ---
I confirm that gather is almost twice as fast on
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
w/r/t
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
(used a benchmark version of PR80248 example)
so on skylake,
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in the following example "aos" does not vectorize while the equivalent aos2
does vectorize using vgatherdps i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796
--- Comment #10 from vincenzo Innocente ---
added a self contained "benchmark"
on my machine
[innocent@vinavx3 ctest]$ c++ -Ofast -Wall SparseOnly.c -march=native ; time
./a.out
0.496u 0.000s 0:00.49 100.0%0+0k 0+0io 0pf+0w
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796
--- Comment #9 from vincenzo Innocente ---
Created attachment 41070
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41070=edit
self contained benchmark of scimark2 SparseMat must
content is not randomized
param must be modified by hand in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796
--- Comment #8 from vincenzo Innocente ---
My understanding of the gather latency is that it essentially corresponds to a
load per cacheline: fast if all items are closeby, slower than scalar loads if
items are all in different cachelines. Not
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
on my machine
after the usual
mkdir scimark2TMP
cd scimark2TMP
wget http://math.nist.gov/scimark2
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
Created attachment 41053
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41053=edit
self contained benchm
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
given
cat aggressiveLoop.cc
#include
#include
float x[1024];
float y[1024];
float w[512];
float z[128];
float c,q
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77859
--- Comment #2 from vincenzo Innocente ---
Thanks for the fast response
I think I can "survive" with -O3 -fno-trapping-math
in principle it should not change the binary compatibility of the output w/r/t
-O2
and at best of my understanding it
-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
It looks to me that to vectorize this code "relaxed floating point math" is not
a requirement
currently gcc version 7.0.0 20161004 (experiment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71666
--- Comment #2 from vincenzo Innocente ---
ok so is just the sentence "" See Optimize Options" which needs to be
changed...
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
as of today
-fprofile-generate does not seem to be documented in
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
it is quoted 4 times including a self-referencing
"
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
with gcc version 7.0.0 20160506 (experimental) [trunk revision 235977] (GCC)
cat main.cpp
int main() { return 0;}
c++ -O2 main.cpp
perf record -e
cpu/event=0xc4,umask=0x20,name
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564
--- Comment #19 from vincenzo Innocente ---
patch applied to
gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC)
I confirm the improvement in timing for c++ and lto
timing difference between gcc and c++ seems to be inside
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564
--- Comment #5 from vincenzo Innocente ---
it is a regression
gcc version 4.9.3 (GCC)
c++ -Ofast *.c; ./a.out
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564
--- Comment #3 from vincenzo Innocente ---
> Any reason you are using the c++ driver here?
Because I am interested in C++ performance
never imagined that the c++ front-end could make a difference on such a code...
>From my point of view it is
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
mkdir scimark2; cd scimark2
wget http://math.nist.gov/scimark2/scimark2_1c.zip
unzip scimark2_1c.zip
c++ -Ofast *.c; ./a.out
c++ -Ofast *.c -flto; ./a.out
with gcc 4.9.3
gcc version
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
constexpr float32x4_t fill(float x) {
float32x4_t v{0
++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
with -Ofast
the code generated differs
float rsqrt1(float a, float x, float y) {
return a/std::sqrt(x)/std::sqrt(y);
}
float rsqrt2(float a, float x, float y) {
return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125
--- Comment #2 from vincenzo Innocente ---
Thanks Marc for the fast check
I am still with
gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)
will update and verify
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125
vincenzo Innocente changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406
--- Comment #5 from vincenzo Innocente ---
does not work...
pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("default")))
fma(float x,float y, float z);
#pragma omp declare simd notinbranch
float __attribute__ ((__target__
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406
--- Comment #4 from vincenzo Innocente ---
#pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("default")))
fma(float x,float y, float z) {
return x+y*z;
}
#pragma omp declare simd notinbranch
float __attribute__
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
CC: jakub at gcc dot gnu.org
Target Milestone: ---
given
at simdCloning.cc
#pragma omp declare simd notinbranch
float fma(float x
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406
--- Comment #2 from vincenzo Innocente ---
is there any mechanism to tell gcc to generate the AVX2 clone using fma?
I understand it reduces portability still at the moment I have to support
mostly
Intel platforms.
for AMD, gcc suggests to use
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
cat ompsimd_t.cc
#pragma omp declare simd notinbranch uniform(q)
float bar(float x, float * q, int){
return q[0]+q[1]*x;
}
c++ -fopenmp
: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
in 5.1 looks ok (according to http://gcc.godbolt.org)
cat condBug.cc
float v0
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
in the following example (compiled with -Ofast -std=c++11) the kahan summation
pattern is recognized in sum, not in counter
see
http://goo.gl
: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
given this code
#include x86intrin.h
typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
inline
float32x4_t atan(float32x4_t t) {
constexpr float PIO4F
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599
--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
I agree that the code produces correct results. It looks to me sub-optimal.
I understand that with Ofast the sequence below will be always executed
andps%xmm5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829
--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
just to add the OpenCL syntax and doc
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/any.html
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50374
vincenzo Innocente vincenzo.innocente at cern dot ch changed:
What|Removed |Added
Known to fail
Priority: P3
Component: web
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
At the very bottom of https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
one reads
It is possible to cast from one vector type to another, provided
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
I was expecting gcc to substitute min/max instruction for (a/b) ? a : b;
even for O2.
This is not always the case, only Ofast provides
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747
--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
I think you need -fno-signed-zeros for the transformation to be valid.
possible.
but then is the O2 code that is wrong?
in any case adding -fno-signed-zeros makes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747
--- Comment #4 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
confirm that
-ffinite-math-only -fno-signed-zeros
is equivalent to Ofast in this case
so we conclude that the code generated at O2 is wrong and
-ffinite-math-only -fno
: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
gcc is lacking a mechanism to convert (C-style cast) efficiently
extended-vectors among different types.
clang has recently introduce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829
vincenzo Innocente vincenzo.innocente at cern dot ch changed:
What|Removed |Added
Summary|Feature request: generic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796
--- Comment #5 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
so with latest 4.9
gcc version 4.10.0 20140611 (experimental) [trunk revision 211467] (GCC)
situation has not changed much (the scalar version is now faster!):
I think
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381
--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
I am still at trunk revision 210507
will update and test again
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381
vincenzo Innocente vincenzo.innocente at cern dot ch changed:
What|Removed |Added
Status|UNCONFIRMED
++
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
cat ceLambda.cc
struct Bar { constexpr Bar(float i):f(i){}; float f;};
float foo1(float x) {
constexpr Bar z{0};
auto f = [=](auto a, auto b) - Bar { return z;};
return f(x,x).f
: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
in this example gcc generates 4 permutations for foo (while none is required)
On the positive side the code for bar (which is a more realistic use case)
seems optimal.
float
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61338
--- Comment #1 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
if I write it reverse
void foo2() {
for (int i=511; i=0; --i)
x[1023-i] += y[1023-i]*z[512-i];
}
its ok
__Z4foo2v:
LFB1:
leaq2048+_x(%rip), %rdx
xorl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49363
--- Comment #23 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
Which Syntax?
I want to reuse the same code for the various architecture and let gcc deal
with vectorization details.
The best I manage to do to share code is something
1 - 100 of 493 matches
Mail list logo