Re: Loop optimization

2010-05-22 Thread bearophile
Walter Bright:
 for (int j=0;j1e6-1;j++)
 
 The j1e6-1 is a floating point operation. It should be redone as an int one:
   j1_000_000-1

The syntax 1e6 can represent an integer value of one million as perfectly and 
as precisely as 1_000_000, but traditionally in many languages the 
exponential syntax is used to represent floating point values only, I don't 
know why.
If the OP wants a short syntax to represent one million, this syntax can be 
used in D2:
foreach (j; 0 .. 10^^6)

Bye,
bearophile


Re: Loop optimization

2010-05-21 Thread Walter Bright

kai wrote:


Here is a boiled down test case:

void main (string[] args)
{
double [] foo = new double [cast(int)1e6];
for (int i=0;i1e3;i++)
{
for (int j=0;j1e6-1;j++)
{
foo[j]=foo[j]+foo[j+1];
}
}
}

Any ideas?


for (int j=0;j1e6-1;j++)

The j1e6-1 is a floating point operation. It should be redone as an int one:
 j1_000_000-1


Re: Loop optimization

2010-05-19 Thread Joseph Wakeling
On 05/17/2010 01:15 AM, Walter Bright wrote:
 bearophile wrote:
 DMD compiler doesn't perform many optimizations,
 
 This is simply false. DMD does an excellent job with integer and pointer
 operations. It does a so-so job with floating point.

Interesting to note, relative to my earlier experience with D vs. C++ speed:
http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.comgroup=digitalmars.D.learnartnum=19567

I'll have to try and put together a no-floating-point bit of code to
make a comparison.

Best wishes,

-- Joe


Re: Loop optimization

2010-05-18 Thread Walter Bright

bearophile wrote:

Walter Bright:


In my view, such switches are bad news, because:


The Intel compiler, Microsoft compiler, GCC and LLVM have a similar switch
(fp:fast in the Microsoft compiler, -ffast-math on GCC, etc). So you might
send your list of comments to the devs of each of those four compilers.


If I agreed with everything other vendors did with their compilers, I wouldn't 
have built my own g.





Re: Loop optimization

2010-05-17 Thread bearophile
Walter Bright:

 is not done because of roundoff error. Also,
 0 * x = 0
 is also not done because it is not a correct replacement if x is a NaN.

I have done a little experiment, compiling this D1 code with LDC:


import tango.stdc.stdio: printf;
void main(char[][] args) {
double x = cast(double)args.length;
double y = 0 * x;
printf(%f\n, y);
}


I think the asm generated by ldc shows what you say:


ldc -O3 -release -inline -output-s test
_Dmain:
pushl   %ebp
movl%esp, %ebp
andl$-16, %esp
subl$32, %esp
movsd   .LCPI1_0, %xmm0
movd8(%ebp), %xmm1
orps%xmm0, %xmm1
subsd   %xmm0, %xmm1
pxor%xmm0, %xmm0
mulsd   %xmm1, %xmm0
movsd   %xmm0, 4(%esp)
movl$.str, (%esp)
callprintf
xorl%eax, %eax
movl%ebp, %esp
popl%ebp
ret $8



So I have added an extra unsafe floating point optimization:

ldc -O3 -release -inline -enable-unsafe-fp-math -output-s test
_Dmain:
subl$12, %esp
movl$0, 8(%esp)
movl$0, 4(%esp)
movl$.str, (%esp)
callprintf
xorl%eax, %eax
addl$12, %esp
ret $8


GCC has similar switches.

Bye,
bearophile


Re: Loop optimization

2010-05-17 Thread Don

Walter Bright wrote:

Don wrote:

bearophile wrote:

kai:

Any ideas? Am I somehow not hitting a vital compiler optimization?


DMD compiler doesn't perform many optimizations, especially on 
floating point computations.


More precisely:
In terms of optimizations performed, DMD isn't too far behind gcc. But 
it performs almost no optimization on floating point. Also, the 
inliner doesn't yet support the newer D features (this won't be hard 
to fix) and the scheduler is based on Pentium1.


Have to be careful when talking about floating point optimizations. For 
example,


   x/c = x * 1/c

is not done because of roundoff error. Also,

   0 * x = 0

is also not done because it is not a correct replacement if x is a NaN.


The most glaring limitation of the FP optimiser is that it seems to 
never keep values in the FP stack. So that it will often do:

FSTP x
FLD x
instead of FST x
Fixing this would probably give a speedup of ~20% on almost all FP code, 
and would unlock the path to further optimisation.


Re: Loop optimization

2010-05-17 Thread Steven Schveighoffer
On Fri, 14 May 2010 12:40:52 -0400, bearophile bearophileh...@lycos.com  
wrote:



Steven Schveighoffer:

In C/C++, the default value for doubles is 0.


I think in C and C++ the default value for doubles is uninitialized  
(that is anything).


You are probably right.  All I did to figure this out is print out the  
first element of the array in my C++ version of kai's code.  So it may be  
arbitrarily set to 0.


-Steve


Re: Loop optimization

2010-05-17 Thread BCS

Hello Don,


The most glaring limitation of the FP optimiser is that it seems to
never keep values in the FP stack. So that it will often do:
FSTP x
FLD x
instead of FST x
Fixing this would probably give a speedup of ~20% on almost all FP
code, and would unlock the path to further optimisation.


Does DMD have the ground work for doing FP keyhole optimizations? That sound 
like an easy one.


--
... IXOYE





Re: Loop optimization

2010-05-17 Thread bearophile
Walter Bright:

In my view, such switches are bad news, because:

The Intel compiler, Microsoft compiler, GCC and LLVM have a similar switch 
(fp:fast in the Microsoft compiler, -ffast-math on GCC, etc). So you might send 
your list of comments to the devs of each of those four compilers.

I have used the unsafe fp switch in LDC to run faster my small raytracers, 
with good results. So I use it now and then where max precision is not 
important and small errors are not going to ruin the output.

I have asked the LLVM head developer to improve this optimization on LLVM, 
because in my opinion it's not aggressive enough, to put LLVM on par with GCC. 
So LDC too will probably get better on this, in future. This unsafe 
optimization is off on default, so if you don't like it you can avoid it. Its 
presence in LDC has caused zero problems to me so far in LDC (because when I 
need safer/more precise results I don't use it).


4. most of those optimizations can be done by hand if you want to, meaning 
that then their behavior will be reliable, portable and correct for your 
application

This is true for any optimization.

Bye,
bearophile


Re: Loop optimization

2010-05-16 Thread Don

strtr wrote:

== Quote from Don (nos...@nospam.com)'s article

strtr wrote:

== Quote from bearophile (bearophileh...@lycos.com)'s article

But the bigger problem in your code is that you are performing operations on

NaNs (that's the default initalization of FP values in D), and operations on 
NaNs
are usually quite slower.

I didn't know that. Is it the same for inf?

Yes, nan and inf are usually the same speed. However, it's very CPU
dependent, and even *within* a CPU! On Pentium 4, for example, for x87,
nan is 200 times slower than a normal value (!), but on Pentium 4 SSE
there's no speed difference at all between nan and normal. I think
there's no speed difference on AMD, but I'm not sure.
There's almost no documentation on it at all.


Thanks!
NaNs being slower I can understand but inf might well be a value you want to 
use.


Yes. What's happened is that none of the popular programming languages 
support special IEEE values, so they're given very low priority by chip 
designers. In the Pentium 4 case, they're implemented entirely in 
microcode. A 200X slowdown is really significant.


However, the bit pattern for NaN is 0x..., which is the same as a 
negative integer, so an uninitialized floating-point variable has a 
quite high probability of being a NaN. I'm certain there's a lot of C 
programs out there which are inadvertantly using NaNs.


Re: Loop optimization

2010-05-16 Thread div0
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jérôme M. Berger wrote:
 div0 wrote:
 Jérôme M. Berger wrote:
 That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 
 No it's not, it's always uninitialized.

   According to the C89 standard and onwards it *must* be initialized
 to 0. If it isn't then your implementation isn't standard compliant
 (needless to say, gcc, Visual, llvm, icc and dmc are all standard
 compliant, so you won't have any difficulty checking).

Ah, I only do C++, where the standard is to not initialise.
I didn't know the two specs had diverged like that.

- --
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL7/wlT9LetA9XoXwRAtiuAKCsbvt0KXymdZV4SBNG2lMRB9MM6QCgo9pm
qGbY++2jGP9W/lELsnq47Zs=
=8KpC
-END PGP SIGNATURE-


Re: Loop optimization

2010-05-16 Thread Jouko Koski

div0 d...@users.sourceforge.net wrote:

Jérôme M. Berger wrote:

That depends. In C/C++, the default value for any global variable
is to have all bits set to 0 whatever that means for the actual data
type.

Ah, I only do C++, where the standard is to not initialise.


No, in C++ all *global or static* variables are zero-initialized. By 
default, stack variables are default-initialized, which means that doubles 
in stack can have any value (they are uninitialized).


The C-function calloc is required to fill the newly allocated memory with 
zero bit pattern; malloc is not required to initialize anything. Fresh heap 
areas given by malloc may have zero bit pattern, but one should really make 
no assumptions on this.


--
Jouko 



Re: Loop optimization

2010-05-16 Thread Jérôme M. Berger
div0 wrote:
 Jérôme M. Berger wrote:
 div0 wrote:
 Jérôme M. Berger wrote:
That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 
 No it's not, it's always uninitialized.

  According to the C89 standard and onwards it *must* be initialized
 to 0. If it isn't then your implementation isn't standard compliant
 (needless to say, gcc, Visual, llvm, icc and dmc are all standard
 compliant, so you won't have any difficulty checking).
 
 Ah, I only do C++, where the standard is to not initialise.
 I didn't know the two specs had diverged like that.
 
The specs haven't diverged and C++ has mostly the same behaviour as
C where global variables are concerned. The only difference is that
if the global variable is a class with a constructor, then that
constructor gets called after the memory is zeroed out.

Jerome
-- 
mailto:jeber...@free.fr
http://jeberger.free.fr
Jabber: jeber...@jabber.fr



signature.asc
Description: OpenPGP digital signature


Re: Loop optimization

2010-05-16 Thread Walter Bright

Don wrote:

bearophile wrote:

kai:

Any ideas? Am I somehow not hitting a vital compiler optimization?


DMD compiler doesn't perform many optimizations, especially on 
floating point computations.


More precisely:
In terms of optimizations performed, DMD isn't too far behind gcc. But 
it performs almost no optimization on floating point. Also, the inliner 
doesn't yet support the newer D features (this won't be hard to fix) and 
the scheduler is based on Pentium1.


Have to be careful when talking about floating point optimizations. For example,

   x/c = x * 1/c

is not done because of roundoff error. Also,

   0 * x = 0

is also not done because it is not a correct replacement if x is a NaN.


Re: Loop optimization

2010-05-16 Thread Walter Bright

bearophile wrote:

DMD compiler doesn't perform many optimizations,


This is simply false. DMD does an excellent job with integer and pointer 
operations. It does a so-so job with floating point.


There are probably over a thousand optimizations at all levels that dmd does 
with integer and pointer code.


Compare the generated code with and without -O. Even without -O, dmd does a long 
list of optimizations (such as common subexpression elimination).


Re: Loop optimization

2010-05-16 Thread bearophile
Walter Bright:
 This is simply false. DMD does an excellent job with integer and pointer 
 operations. It does a so-so job with floating point.
 There are probably over a thousand optimizations at all levels that dmd does 
 with integer and pointer code.

You are of course right, I understand your feelings, I am a stupid -.-
I must be more precise in my posts. You are right that surely dmd performs 
numerous optimizations. What I meant to say was a comparison with other 
compilers, particularly ldc. And even then generic words about a generic 
comparison aren't useful. So I am sorry.

Bye,
bearophile


Re: Loop optimization

2010-05-16 Thread Brad Roberts
On 5/16/2010 4:15 PM, Walter Bright wrote:
 bearophile wrote:
 DMD compiler doesn't perform many optimizations,
 
 This is simply false. DMD does an excellent job with integer and pointer
 operations. It does a so-so job with floating point.
 
 There are probably over a thousand optimizations at all levels that dmd
 does with integer and pointer code.
 
 Compare the generated code with and without -O. Even without -O, dmd
 does a long list of optimizations (such as common subexpression
 elimination).

While it's false that DMD doesn't do many optimizations.  It's true that it's
behind more modern compiler optimizers.

I've been working to fix some of the grossly bad holes in dmd's inliner which is
one are that's just obviously lacking (see bug 2008).  But gcc and ldc (and
likely msvc though I lack any direct knowledge) are simply a decade or so ahead.
 It's not a criticism of dmd or a suggestion that the priorities are in the
wrong place, just a point of fact.  They've got larger teams of people and are
spending significant time on just improving and adding optimizations.

Later,
Brad


Re: Loop optimization

2010-05-15 Thread Don

bearophile wrote:

kai:

Any ideas? Am I somehow not hitting a vital compiler optimization?


DMD compiler doesn't perform many optimizations, especially on floating point 
computations.


More precisely:
In terms of optimizations performed, DMD isn't too far behind gcc. But 
it performs almost no optimization on floating point. Also, the inliner 
doesn't yet support the newer D features (this won't be hard to fix) and 
the scheduler is based on Pentium1.


Re: Loop optimization

2010-05-15 Thread div0
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jérôme M. Berger wrote:

   That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 

No it's not, it's always uninitialized.

Visual studio will initialise memory  a functions stack segment with
0xcd, but only in debug builds. In release mode you get what was already
there. That used to be the case with gcc (which used 0xdeadbeef) as well
unless they've changed it.

- --
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFL7qNxT9LetA9XoXwRAnApAJ9rSzMN9dy1mxMFdBzASaESlkpvCQCfTRWO
GlaukVSRKe3prjs/jXe73CU=
=tgCi
-END PGP SIGNATURE-


Re: Loop optimization

2010-05-15 Thread Jérôme M. Berger
div0 wrote:
 Jérôme M. Berger wrote:
  That depends. In C/C++, the default value for any global variable
 is to have all bits set to 0 whatever that means for the actual data
 type. 
 
 No it's not, it's always uninitialized.
 
According to the C89 standard and onwards it *must* be initialized
to 0. If it isn't then your implementation isn't standard compliant
(needless to say, gcc, Visual, llvm, icc and dmc are all standard
compliant, so you won't have any difficulty checking).

 Visual studio will initialise memory  a functions stack segment with
 0xcd, but only in debug builds. In release mode you get what was already
 there. That used to be the case with gcc (which used 0xdeadbeef) as well
 unless they've changed it.
 
This does not concern global variables. Therefore the second part
of my message applies, the part you didn't quote:
 The default value for local variables and malloc/new memory is
 whatever was in this place in memory before which can be anything.
 The default value for calloc is to have all bits to 0 as for global
 variables.

I should have added that some compiler / standard libraries allow
you to have a default initialization value for debugging purpose.

Jerome
-- 
mailto:jeber...@free.fr
http://jeberger.free.fr
Jabber: jeber...@jabber.fr



signature.asc
Description: OpenPGP digital signature


Re: Loop optimization

2010-05-15 Thread Don

strtr wrote:

== Quote from bearophile (bearophileh...@lycos.com)'s article

But the bigger problem in your code is that you are performing operations on

NaNs (that's the default initalization of FP values in D), and operations on 
NaNs
are usually quite slower.

I didn't know that. Is it the same for inf?


Yes, nan and inf are usually the same speed. However, it's very CPU 
dependent, and even *within* a CPU! On Pentium 4, for example, for x87, 
nan is 200 times slower than a normal value (!), but on Pentium 4 SSE 
there's no speed difference at all between nan and normal. I think 
there's no speed difference on AMD, but I'm not sure.

There's almost no documentation on it at all.



I used it as a null for structs.



Re: Loop optimization

2010-05-15 Thread strtr
== Quote from Don (nos...@nospam.com)'s article
 strtr wrote:
  == Quote from bearophile (bearophileh...@lycos.com)'s article
  But the bigger problem in your code is that you are performing operations 
  on
  NaNs (that's the default initalization of FP values in D), and operations 
  on NaNs
  are usually quite slower.
 
  I didn't know that. Is it the same for inf?
 Yes, nan and inf are usually the same speed. However, it's very CPU
 dependent, and even *within* a CPU! On Pentium 4, for example, for x87,
 nan is 200 times slower than a normal value (!), but on Pentium 4 SSE
 there's no speed difference at all between nan and normal. I think
 there's no speed difference on AMD, but I'm not sure.
 There's almost no documentation on it at all.

Thanks!
NaNs being slower I can understand but inf might well be a value you want to 
use.

  I used it as a null for structs.
 



Re: Loop optimization

2010-05-15 Thread Ali Çehreli

Steven Schveighoffer wrote:

 double [] foo = new double [cast(int)1e6];
 foo[] = 0;

I've discovered that this is the equivalent of the last line above:

  foo = 0;

I don't see it in the spec. Is that an old or an unintended feature?

Ali


Re: Loop optimization

2010-05-15 Thread Simen kjaeraas

Ali Çehreli acehr...@yahoo.com wrote:


Steven Schveighoffer wrote:

  double [] foo = new double [cast(int)1e6];
  foo[] = 0;

I've discovered that this is the equivalent of the last line above:

   foo = 0;

I don't see it in the spec. Is that an old or an unintended feature?


Looks unintended to me.  In fact (though that might be the
C programmer in me doing the thinking), it looks to me like
foo = null;. It might be related to the discussion in
digitalmars.D Is [] mandatory for array operations?.

--
Simen


Re: Loop optimization

2010-05-15 Thread Ali Çehreli

Simen kjaeraas wrote:
 Ali Çehreli acehr...@yahoo.com wrote:

 Steven Schveighoffer wrote:

   double [] foo = new double [cast(int)1e6];
   foo[] = 0;

 I've discovered that this is the equivalent of the last line above:

foo = 0;

 I don't see it in the spec. Is that an old or an unintended feature?

I have to make a correction: It works with fixed-sized arrays. It does 
not work with the dynamic array initialization above.


 Looks unintended to me.  In fact (though that might be the
 C programmer in me doing the thinking), it looks to me like
 foo = null;. It might be related to the discussion in
 digitalmars.D Is [] mandatory for array operations?.

Thanks,
Ali


Re: Loop optimization

2010-05-15 Thread bearophile
Ali Çehreli:
 I don't see it in the spec. Is that an old or an unintended feature?

It's a compiler bug, don't use that bracket less syntax in your programs.
Don is fighting to fix such problems (and I have written several posts and bug 
reports on that stuff).

Bye,
bearophile


Re: Loop optimization

2010-05-14 Thread Lars T. Kyllingstad
On Fri, 14 May 2010 02:38:40 +, kai wrote:

 Hello,
 
 I was evaluating using D for some numerical stuff. However I was
 surprised to find that looping  array indexing was not very speedy
 compared to alternatives (gcc et al). I was using the DMD2 compiler on
 mac and windows, with -O -release. Here is a boiled down test case:
 
   void main (string[] args)
   {
   double [] foo = new double [cast(int)1e6]; for (int 
i=0;i1e3;i++)
   {
   for (int j=0;j1e6-1;j++)
   {
   foo[j]=foo[j]+foo[j+1];
   }
   }
   }
 
 Any ideas? Am I somehow not hitting a vital compiler optimization?
 Thanks for your help.

Two suggestions:


1. Have you tried the -noboundscheck compiler switch?  Unlike C, D checks
that you do not try to read/write beyond the end of an array, but you can
turn those checks off with said switch.


2. Can you use vector operations?  If the example you gave is
representative of your specific problem, then you can't because you are
adding overlapping parts of the array.  But if you are doing operations on
separate arrays, then array operations will be *much* faster.

http://www.digitalmars.com/d/2.0/arrays.html#array-operations

As an example, compare the run time of the following code with the example
you gave:

void main ()
{
double[] foo = new double [cast(int)1e6];
double[] slice1 = foo[0 .. 999_998];
double[] slice2 = foo[1 .. 999_999];

for (int i=0;i1e3;i++)
{
// BAD, BAD, BAD.  DON'T DO THIS even though
// it's pretty awesome:
slice1[] += slice2[];
}
}

Note that this is very bad code, since slice1 and slice2 are overlapping
arrays, and there is no guarantee as to which order the array elements are
computed -- it may even occur in parallel.  It was just an example of the
speed gains you may expect from designing your code with array operations
in mind.

-Lars


Re: Loop optimization

2010-05-14 Thread Lars T. Kyllingstad
On Fri, 14 May 2010 06:31:29 +, Lars T. Kyllingstad wrote:
 void main ()
 {
 double[] foo = new double [cast(int)1e6]; double[] slice1 =
 foo[0 .. 999_998];
 double[] slice2 = foo[1 .. 999_999];
 
 for (int i=0;i1e3;i++)
 {
 // BAD, BAD, BAD.  DON'T DO THIS even though // it's pretty
 awesome:
 slice1[] += slice2[];
 }
 }

Hmm.. something very strange is going on with the line breaking here.

-Lars


Re: Loop optimization

2010-05-14 Thread bearophile
kai:

 I was evaluating using D for some numerical stuff.

For that evaluation you probably have to use the LDC compiler, that is able to 
optimize better.


   void main (string[] args)
   {
   double [] foo = new double [cast(int)1e6];
   for (int i=0;i1e3;i++)
   {
   for (int j=0;j1e6-1;j++)
   {
   foo[j]=foo[j]+foo[j+1];
   }
   }
   }

Using floating point for indexes and lengths is not a good practice. In D large 
numbers are written like 1_000_000.
Use -release too.

 
 Any ideas? Am I somehow not hitting a vital compiler optimization?

DMD compiler doesn't perform many optimizations, especially on floating point 
computations.
But the bigger problem in your code is that you are performing operations on 
NaNs (that's the default initalization of FP values in D), and operations on 
NaNs are usually quite slower.


Your code in C:

#include stdio.h
#include stdlib.h
#define N 100

int main() {
double *foo = calloc(N, sizeof(double)); // malloc suffices here
int i, j;
for (j = 0; j  N; j++)
foo[j] = 1.0;

for (i = 0; i  1000; i++)
for (j = 0; j  N-1; j++)
foo[j] = foo[j] + foo[j + 1];

printf(%f, foo[N-1]);
return 0;
}

/*
gcc -O3 -s -Wall test.c -o test
Timings, outer loop=1_000 times: 7.72

--

gcc -Wall -O3 -fomit-frame-pointer -msse3 -march=native test.c -o test
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.69 s
Just the inner loop:
.L7:
fldl8(%edx)
fadd%st, %st(1)
fxch%st(1)
fstpl   (%edx)
addl$8, %edx
cmpl%ecx, %edx
jne .L7
*/



Your code in D1:

version (Tango)
import tango.stdc.stdio: printf;
else
import std.c.stdio: printf;

void main() {
const int N = 1_000_000;
double[] foo = new double[N];
foo[] = 1.0;

for (int i = 0; i  1_000; i++)
for (int j = 0; j  N-1; j++)
foo[j] = foo[j] + foo[j + 1];

printf(%f, foo[N-1]);
}


/*
dmd -O -release -inline test.d
(Not running on a VirtualBox)
Timings, outer loop=1_000 times: 9.35 s
Just the inner loop:
L34:fld qword ptr 8[EDX*8][ECX]
faddqword ptr [EDX*8][ECX]
fstpqword ptr [EDX*8][ECX]
inc EDX
cmp EDX,0F423Fh
jb  L34

---

ldc -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.87 s
Just the inner loop:
.LBB1_2:
movsd   (%eax,%ecx,8), %xmm0
addsd   8(%eax,%ecx,8), %xmm0
movsd   %xmm0, (%eax,%ecx,8)
incl%ecx
cmpl$99, %ecx
jne .LBB1_2

---

ldc -unroll-allow-partial -O3 -release -inline test.d
(Running on a VirtualBox)
Timings, outer loop=1_000 times: 7.75 s
Just the inner loop:
.LBB1_2:
movsd   (%eax,%ecx,8), %xmm0
addsd   8(%eax,%ecx,8), %xmm0
movsd   %xmm0, (%eax,%ecx,8)
movsd   8(%eax,%ecx,8), %xmm0
addsd   16(%eax,%ecx,8), %xmm0
movsd   %xmm0, 8(%eax,%ecx,8)
movsd   16(%eax,%ecx,8), %xmm0
addsd   24(%eax,%ecx,8), %xmm0
movsd   %xmm0, 16(%eax,%ecx,8)
movsd   24(%eax,%ecx,8), %xmm0
addsd   32(%eax,%ecx,8), %xmm0
movsd   %xmm0, 24(%eax,%ecx,8)
movsd   32(%eax,%ecx,8), %xmm0
addsd   40(%eax,%ecx,8), %xmm0
movsd   %xmm0, 32(%eax,%ecx,8)
movsd   40(%eax,%ecx,8), %xmm0
addsd   48(%eax,%ecx,8), %xmm0
movsd   %xmm0, 40(%eax,%ecx,8)
movsd   48(%eax,%ecx,8), %xmm0
addsd   56(%eax,%ecx,8), %xmm0
movsd   %xmm0, 48(%eax,%ecx,8)
movsd   56(%eax,%ecx,8), %xmm0
addsd   64(%eax,%ecx,8), %xmm0
movsd   %xmm0, 56(%eax,%ecx,8)
movsd   64(%eax,%ecx,8), %xmm0
addsd   72(%eax,%ecx,8), %xmm0
movsd   %xmm0, 64(%eax,%ecx,8)
addl$9, %ecx
cmpl$99, %ecx
jne .LBB1_2
*/

As you see the code generated by ldc is about as good as the one generated by 
gcc. There are of course other ways to optimize this code...

Bye,
bearophile


Re: Loop optimization

2010-05-14 Thread Lars T. Kyllingstad
On Fri, 14 May 2010 07:32:54 -0400, Steven Schveighoffer wrote:

 On Fri, 14 May 2010 02:31:29 -0400, Lars T. Kyllingstad
 pub...@kyllingen.nospamnet wrote:
 
 On Fri, 14 May 2010 02:38:40 +, kai wrote:
 
 
 I was using the DMD2 compiler on
 mac and windows, with -O -release.

 1. Have you tried the -noboundscheck compiler switch?  Unlike C, D
 checks that you do not try to read/write beyond the end of an array,
 but you can turn those checks off with said switch.
 
 -release implies -noboundscheck (in fact, I did not know there was a
 noboundscheck flag, I thought you had to use -release).
 
 -Steve


You are right, just checked it now.  But it's strange, I thought the 
whole point of the -noboundscheck switch was that it would be independent 
of -release.  But perhaps I remember wrongly (or perhaps Walter just 
hasn't gotten around to it yet).

Anyway, sorry for the misinformation.

-Lars


Re: Loop optimization

2010-05-14 Thread Steven Schveighoffer

On Thu, 13 May 2010 22:38:40 -0400, kai k...@nospam.zzz wrote:


Hello,

I was evaluating using D for some numerical stuff. However I was  
surprised to

find that looping  array indexing was not very speedy compared to
alternatives (gcc et al). I was using the DMD2 compiler on mac and  
windows,

with -O -release. Here is a boiled down test case:

void main (string[] args)
{
double [] foo = new double [cast(int)1e6];
for (int i=0;i1e3;i++)
{
for (int j=0;j1e6-1;j++)
{
foo[j]=foo[j]+foo[j+1];
}
}
}

Any ideas? Am I somehow not hitting a vital compiler optimization?  
Thanks for

your help.


I figured it out.

in D, the default value for doubles is nan, so you are adding countless  
scores of nan's which is costly for some reason (not a big floating point  
guy, so I'm not sure about this).


In C/C++, the default value for doubles is 0.

BTW, without any initialization of the array, what are you expecting the  
code to do?  In the C++ version, I suspect you are simply adding a bunch  
of 0s together.


Equivalent D code which first initializes the array to 0s:

void main (string[] args)
{
double [] foo = new double [cast(int)1e6];
foo[] = 0; // probably want to change this to something more meaningful
for (int i=0;icast(int)1e3;i++)
{
for (int j=0;jcast(int)1e6-1;j++)
{
foo[j]+=foo[j+1];
}
}
}

On my PC, it runs almost exactly at the same speed as the C++ version.

-Steve


Re: Loop optimization

2010-05-14 Thread kai
Thanks for the help all!

 2. Can you use vector operations?  If the example you gave is
 representative of your specific problem, then you can't because you are
 adding overlapping parts of the array.  But if you are doing operations
 on separate arrays, then array operations will be *much* faster.

Unfortunately, I don't think I will be able to. The actual code is
computing norms of a sequence of points and then updating their values as
needed (MLE smoothing/prediction).

 For that evaluation you probably have to use the LDC compiler, that is
 able to optimize better.

I was scared off by the warning that D 2.0 support is experimental. I
realize D 2 itself is still non-production, but for academic interests
industrial-strength isnt all that important if it usually works :).

 Using floating point for indexes and lengths is not a good practice.
 In D large numbers are written like 1_000_000. Use -release too.

Good to know, thanks (thats actually a great feature for scientists!).

 DMD compiler doesn't perform many optimizations, especially on floating
 point computations. But the bigger problem in your code is that you are
 performing operations on NaNs (that's the default initalization of FP
 values in D), and operations on NaNs are usually quite slower.

 in D, the default value for doubles is nan, so you are adding countless
 scores of nan's which is costly for some reason (not a big floating point
 guy, so I'm not sure about this).

Ah ha, that was it-- serves me right for trying to boil down a test case and
failing miserably. I'll head back to my code now and try to find the real
problem :-) At some point I removed the initialization data obviously.


Re: Loop optimization

2010-05-14 Thread strtr
== Quote from bearophile (bearophileh...@lycos.com)'s article
 But the bigger problem in your code is that you are performing operations on
NaNs (that's the default initalization of FP values in D), and operations on 
NaNs
are usually quite slower.

I didn't know that. Is it the same for inf?
I used it as a null for structs.



Re: Loop optimization

2010-05-14 Thread bearophile
kai:

 I was scared off by the warning that D 2.0 support is experimental.

LDC is D1 still, mostly :-(
And at the moment it uses LLVM 2.6.
LLVM 2.7 contains a new optimization that can improve that code some more.


 Good to know, thanks (thats actually a great feature for scientists!).

In theory D is a bit fit for numerical computations too, but there is lot of 
work to do still. And some parts of D design will need to be improved to help 
numerical code performance.

From my extensive tests, if you use it correctly, D1 code compiled with LDC 
can be about as efficient as C code compiled with GCC or sometimes a little 
more efficient.

-

Steven Schveighoffer:
 In C/C++, the default value for doubles is 0.

I think in C and C++ the default value for doubles is uninitialized (that is 
anything).

Bye,
bearophile


Re: Loop optimization

2010-05-14 Thread Jérôme M. Berger
bearophile wrote:
 kai:
 
 I was scared off by the warning that D 2.0 support is experimental.
 
 LDC is D1 still, mostly :-(
 And at the moment it uses LLVM 2.6.
 LLVM 2.7 contains a new optimization that can improve that code some more.
 
 
 Good to know, thanks (thats actually a great feature for scientists!).
 
 In theory D is a bit fit for numerical computations too, but there is lot of 
 work to do still. And some parts of D design will need to be improved to help 
 numerical code performance.
 
 From my extensive tests, if you use it correctly, D1 code compiled with LDC 
 can be about as efficient as C code compiled with GCC or sometimes a little 
 more efficient.
 
 -
 
 Steven Schveighoffer:
 In C/C++, the default value for doubles is 0.
 
 I think in C and C++ the default value for doubles is uninitialized (that 
 is anything).
 
That depends. In C/C++, the default value for any global variable
is to have all bits set to 0 whatever that means for the actual data
type. The default value for local variables and malloc/new memory is
whatever was in this place in memory before which can be anything.
The default value for calloc is to have all bits to 0 as for global
variables.

In the OP code, the malloc will probably return memory that has
never been used before, therefore probably initialized to 0 too (OS
dependent).

Jerome
-- 
mailto:jeber...@free.fr
http://jeberger.free.fr
Jabber: jeber...@jabber.fr



signature.asc
Description: OpenPGP digital signature