subject:"Performance of tables slower than built in\?"

Re: Performance of tables slower than built in?

2019-05-25 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Saturday, 25 May 2019 at 20:07:19 UTC, NaN wrote:
Its differential evolution, you can pretty much optimise 
anything you can quantify.


http://www1.icsi.berkeley.edu/~storn/code.html


Thanks, they even have Python example code. Nice.

Re: Performance of tables slower than built in?

2019-05-25 Thread NaN via Digitalmars-d-learn

On Saturday, 25 May 2019 at 14:58:25 UTC, Ola Fosheim Grøstad 
wrote:

On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote:
I used an evolutionary optimisation algorithm on the table all 
at once. So you do a weighted sum of max deviation, and 1st 
and 2nd order discontinuity at the joins. And minimise that 
across the table as a whole. It seemed you could massively 
overweight the discontinuities without really affecting the 
curve fitting that much. So perfect joins only cost a little 
extra deviation in the fit of the polynomial.


Wow, it is pretty cool that you managed to minimize 2nd order 
discontinuity like that! Is there a paper describing the 
optimisation algorithm you used?


Its differential evolution, you can pretty much optimise anything 
you can quantify.


http://www1.icsi.berkeley.edu/~storn/code.html

Re: Performance of tables slower than built in?

2019-05-25 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote:
I used an evolutionary optimisation algorithm on the table all 
at once. So you do a weighted sum of max deviation, and 1st and 
2nd order discontinuity at the joins. And minimise that across 
the table as a whole. It seemed you could massively overweight 
the discontinuities without really affecting the curve fitting 
that much. So perfect joins only cost a little extra deviation 
in the fit of the polynomial.


Wow, it is pretty cool that you managed to minimize 2nd order 
discontinuity like that! Is there a paper describing the 
optimisation algorithm you used?

Re: Performance of tables slower than built in?

2019-05-25 Thread Danny Arends via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


[...]


I'll just leave this here:

https://www.dannyarends.nl/Using%20CTFE%20in%20D%20to%20speed%20up%20Sine%20and%20Cosine?

Re: Performance of tables slower than built in?

2019-05-25 Thread NaN via Digitalmars-d-learn

On Saturday, 25 May 2019 at 09:52:22 UTC, Ola Fosheim Grøstad 
wrote:

On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:

Its pretty common technique in audio synthesis.


Indeed. CSound does this.

What i've done in the past is store a table of polynomial 
segments that were optimised with curve fitting.


That's an interesting solution, how do you avoid higher order 
discontinuities between segments? Crossfading? Or maybe it 
wasn't audible.


I used an evolutionary optimisation algorithm on the table all at 
once. So you do a weighted sum of max deviation, and 1st and 2nd 
order discontinuity at the joins. And minimise that across the 
table as a whole. It seemed you could massively overweight the 
discontinuities without really affecting the curve fitting that 
much. So perfect joins only cost a little extra deviation in the 
fit of the polynomial.

Re: Performance of tables slower than built in?

2019-05-25 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:
It's bit extra work to calculate the the waveform but actual 
works out faster than having huge LUTs since you're typically 
only producing maybe 100 samples in each interrupt callback


Another hybrid option when filling a buffer might be to fill two 
buffers with an approximation that is crossfaded between the 
end-points.


A bit tricky, but say you have a taylor polynomial (or even a 
recurrence relation) that  is "correct" near the beginning of the 
buffer, and another one that is correct near the end of the 
buffer.


Then you could fill two buffers from each end and cross fade 
between the buffers. You might get some phase-cancellation errors 
and phasing-like distortion though.

Re: Performance of tables slower than built in?

2019-05-25 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:

Its pretty common technique in audio synthesis.


Indeed. CSound does this.

What i've done in the past is store a table of polynomial 
segments that were optimised with curve fitting.


That's an interesting solution, how do you avoid higher order 
discontinuities between segments? Crossfading? Or maybe it wasn't 
audible.

Re: Performance of tables slower than built in?

2019-05-25 Thread NaN via Digitalmars-d-learn


On Friday, 24 May 2019 at 17:40:40 UTC, Ola Fosheim Grøstad wrote:

On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
I'm not sure what the real precision of the build in functions 
are but it shouldn't be hard to max out a double using 
standard methods(even if slow, but irrelevant after the LUT 
has been created).


LUTs are primarily useful when you use sin(x) as a signal or 
when a crude approximation is good enough.


One advantage of a LUT is that you can store a more complex 
computation than the basic function.  Like a filtered square 
wave.


Its pretty common technique in audio synthesis. What i've done in 
the past is store a table of polynomial segments that were 
optimised with curve fitting. It's bit extra work to calculate 
the the waveform but actual works out faster than having huge 
LUTs since you're typically only producing maybe 100 samples in 
each interrupt callback, so it gets pretty likely that your LUT 
is pushed into slower cache memory in between calls to generate 
the the audio.

Re: Performance of tables slower than built in?

2019-05-25 Thread NaN via Digitalmars-d-learn


On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad 
wrote:

On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
If it truly is a 27x faster then then that is very relevant 
and knowing why is important.


Of course, a lot of that might simply be due to LDC and I 
wasn't able to determine this.


Just one more thing you really ought to consider:

It isn't obvious that a LUT using double will be more precise 
than computing sin using single precision float.


So when I use single precision float with ldc and "-O3 
-ffast-math" then I get roughly 300ms. So in that case the LUT 
is only 3 times faster.


Perhaps not worth it then. You might as well just use float 
instead of double.


The downside is that -ffast-math makes all your computations 
more inaccurate.


It is also possible that recent processors can do multiple 
sin/cos as simd.


Too many options… I know.


The thing is, the LUT can have as much precision as one wants. 
One could even spend several days calculating it then loading 
it from disk.


With linear interpolation of what is effectively a discrete time 
signal, you get an extra 12db signal to noise ratio each time you 
2x oversample the input. So basically if you start with LUT of 4 
samples, that should give you about 8dbs at baseline, each time 
you double the length you get an extra 12dbs.


Or in simpler terms double the table length get 2 bits less 
error. So 8192 table size should give you approximately 24 bits 
clear of errors.


But be aware that error is relative to the maximum magnitude of 
the signal in the table.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
I'm not sure what the real precision of the build in functions 
are but it shouldn't be hard to max out a double using standard 
methods(even if slow, but irrelevant after the LUT has been 
created).


LUTs are primarily useful when you use sin(x) as a signal or when 
a crude approximation is good enough.


One advantage of a LUT is that you can store a more complex 
computation than the basic function.  Like a filtered square wave.

Re: Performance of tables slower than built in?

2019-05-24 Thread Alex via Digitalmars-d-learn


On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad wrote:

On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
If it truly is a 27x faster then then that is very relevant 
and knowing why is important.


Of course, a lot of that might simply be due to LDC and I 
wasn't able to determine this.


Just one more thing you really ought to consider:

It isn't obvious that a LUT using double will be more precise 
than computing sin using single precision float.


So when I use single precision float with ldc and "-O3 
-ffast-math" then I get roughly 300ms. So in that case the LUT 
is only 3 times faster.


Perhaps not worth it then. You might as well just use float 
instead of double.


The downside is that -ffast-math makes all your computations 
more inaccurate.


It is also possible that recent processors can do multiple 
sin/cos as simd.


Too many options… I know.


The thing is, the LUT can have as much precision as one wants. 
One could even spend several days calculating it then loading it 
from disk.


I'm not sure what the real precision of the build in functions 
are but it shouldn't be hard to max out a double using standard 
methods(even if slow, but irrelevant after the LUT has been 
created).


Right now I'm just using the built ins... Maybe later I'll get 
back around to all this and make some progress.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn

With linear interpolation you get roughly the same result with 
floats, a little more efficient too (half the memory and a bit 
faster):


__gshared float[512+1] QuarterSinTab;

void init(){
const auto res = QuarterSinTab.length-1;
for(int i = 0; i < res; i++)
QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res)); 
QuarterSinTab[$-1] = sin(PI*0.5);
}

auto fastQuarterLookup(float x){
const uint mantissa = cast(uint)( (x - floor(x)) * 
(cast(float)(1U<<24)) );
const float sign = cast(float)(1 - 
cast(int)((mantissa>>22)&2));
const uint phase = 
(mantissa^((1U<<23)-((mantissa>>22)&1)))&((1U<<23) -1);

const uint quarterphase = (phase>>13)&511;
const float frac = 
cast(float)(phase&((1U<<13)-1))*cast(float)(1.0f/(1U<<13));
return sign*((1.0f-frac)*QuarterSinTab[quarterphase] + 
frac*QuarterSinTab[quarterphase+1]);

}

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


This appears to get roughly the same results as sin(x):


__gshared double[512+1] QuarterSinTab;

void init(){
const auto res = QuarterSinTab.length-1;
for(int i = 0; i < res; i++)
QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res)); 
QuarterSinTab[$-1] = sin(PI*0.5);
}

auto fastQuarterLookup(double x){
const ulong mantissa = cast(ulong)( (x - floor(x)) * 
(cast(double)(1UL<<54)) );
const double sign = cast(double)(1 - 
cast(int)((mantissa>>52)&2));
const ulong phase = 
(mantissa^((1UL<<53)-((mantissa>>52)&1)))&((1UL<<53) -1);

const uint quarterphase = (phase>>43)&511;
const double frac = 
cast(double)(phase&((1UL<<43)-1))*cast(double)(1.0/(1UL<<43));
return sign*((1.0-frac)*QuarterSinTab[quarterphase] + 
frac*QuarterSinTab[quarterphase+1]);

}


Ola.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
const double sign = 
cast(double)(-cast(uint)((mantissa>>63)&1));


Yep, this was wrong (0 or -1). Should be something like (1 or -1):

const double sign = 
cast(double)(1-cast(uint)((mantissa>>62)&2));


You'll have to code it up more carefully than I did, just 
following the same principles.  (These ±1 errors do not affect 
the performance.).


Also, for comparison, just running the 2 lookups in the loop are 
at 32ms.


So 3 times that sounds reasonable for extracting the phase, 
determining the sign, reversing the phase and doing the linear 
interpolation.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
If it truly is a 27x faster then then that is very relevant and 
knowing why is important.


Of course, a lot of that might simply be due to LDC and I 
wasn't able to determine this.


Just one more thing you really ought to consider:

It isn't obvious that a LUT using double will be more precise 
than computing sin using single precision float.


So when I use single precision float with ldc and "-O3 
-ffast-math" then I get roughly 300ms. So in that case the LUT is 
only 3 times faster.


Perhaps not worth it then. You might as well just use float 
instead of double.


The downside is that -ffast-math makes all your computations more 
inaccurate.


It is also possible that recent processors can do multiple 
sin/cos as simd.


Too many options… I know.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 12:56:55 UTC, Ola Fosheim Grøstad wrote:
suggest above. Just beware that I haven't checked the code, so 
it might be off by ±1 and such.


So before using such code for anything important, make sure that 
it is tested for the edge cases, like denormal numbers (values 
close to zero). Roundoff-errors where computations "accidentally" 
cross over 1.0 and stuff like that.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
So it seems like a quarter-wave LUT is 27 times faster than 
sin…




If so then that is great and what I'd expected to achieve 
originally.


I guess this is using LDC though? I wasn't able to compile with 
LDC since after updating I'm getting linker errors that I have 
to go figure out.


Yes, the gist linked above is just your code with minor changes, 
that was 4-5 times faster. To get to 27 times faster you need to 
use the integer bit-manipulation scheme that I suggest above. 
Just beware that I haven't checked the code, so it might be off 
by ±1 and such.


Anyway, it is more fun for you to code up your own version than 
to try to figure out mine. Just follow the principles and you 
should get close to that performance, I think. (I'll refine the 
code later, but don't really have time now)



You just have to make sure that the generated instructions 
fills the entire CPU pipeline.


What exactly does this mean? I realize the pipeline in cpu's is 
how the cpu decodes and optimizes the instructions but when you 
say "You have to make sure" that pre-supposes there is a method 
or algorithm to know.


Yes, you have to look up information about the CPU in your 
computer. Each core has a set of "lanes" that are computed 
simultanously. Some instructions can go into many lanes, but not 
all. Then there might be bubbles in the pipeline (the lane) that 
can be filled up with integer/bit manipulation instructions. It 
is tedious to look that stuff up. So, last resort. Just try to 
mix simple integer with simple double computations (avoid 
division).



Are you saying that I did not have enough instructions that the 
pipeline could take advantage of?


Yes, you most likely got bubbles. Empty space where the core has 
nothing to send down a lane, because it is waiting for some 
computation to finish so that it can figure out what to do next.


Basic optimization:

Step 1: reduce dependencies between computations

Step 2: make sure you generate a mix of simple integer/double 
instructions that can fill up all the computation lanes at the 
same time


Step 3: make sure loops only contain a few instructions, the CPU 
can unroll loops in hardware if they are short. (not valid here 
though)



Of course, a lot of that might simply be due to LDC and I 
wasn't able to determine this.


I think I got better performance  because I filled more lanes in 
the pipeline.



Half sin was done above but quarter sine can be used(there are 
4 quadrants but only one has to be tabularized because all the 
others differ by sign and reversion(1 - x), it's a matter of 
figuring out the sign).


Yes, as I mentioned, the first bit of the phase is the sign and 
the second bit of the phase is the reversion of the indexing.



Of course it requires extra computation so it would be 
interesting to see the difference in performance for the extra 
logic.


It adds perhaps 2-5 cycles or so, my guessing.


exp(x) can be written as exp(floor(x) + {x}) = 
exp(floor(x))*exp({x})

[...]
With linear interpolation one can get very accurate(for all 
practical purposes) LUT table methods that, if your code is 
right, is at least an order of magnitude faster. The memory 
requirements will be quite small with linear interpolation


I think you need to do something with the x before you look up, 
so that you have some kind of fast nonlinear mapping to the 
indexes.


But for exp() you might prefer an approximation instead, perhaps 
polynomial taylor series perhaps.


Searching the web should give some ideas.



It seems you already have the half-sin done.


I did the quarter sin though, not the half-sin (but that is 
almost the same, just drop the reversion of the indexing).


(Let's talk about this later, since we both have other things on 
our plate. Fun topic! :-)

Re: Performance of tables slower than built in?

2019-05-24 Thread Alex via Digitalmars-d-learn


On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad 
wrote:

On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
Either way, sin it's still twice as fast. Also, in the code 
the sinTab version is missing the writeln so it would have 
been faster.. so it is not being optimized out.


Well, when I run this modified version:

https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

on https://run.dlang.io/

then I get:

LUT:709
sin(x): 2761

So the LUT is 3-4 times faster even with your quarter-LUT 
overhead.


FWIW, as far as I can tell I managed to get the lookup version 
down to 104 by using bit manipulation tricks like these:


auto fastQuarterLookup(double x){
const ulong mantissa = cast(ulong)( (x - floor(x)) * 
(cast(double)(1UL<<63)*2.0) );
const double sign = 
cast(double)(-cast(uint)((mantissa>>63)&1));

… etc

So it seems like a quarter-wave LUT is 27 times faster than sin…



If so then that is great and what I'd expected to achieve 
originally.


I guess this is using LDC though? I wasn't able to compile with 
LDC since after updating I'm getting linker errors that I have to 
go figure out.


You just have to make sure that the generated instructions 
fills the entire CPU pipeline.


What exactly does this mean? I realize the pipeline in cpu's is 
how the cpu decodes and optimizes the instructions but when you 
say "You have to make sure" that pre-supposes there is a method 
or algorithm to know.


Are you saying that I did not have enough instructions that the 
pipeline could take advantage of?



In any case, I'll check your code out and try to figure out the 
details and see exactly what is going on.


If it truly is a 27x faster then then that is very relevant and 
knowing why is important.


Of course, a lot of that might simply be due to LDC and I wasn't 
able to determine this.


Can you do some stats for dmd and ldc?

You seem to be interested in this, are you up for a challenge?


The idea is to use tables to optimize these functions.

Half sin was done above but quarter sine can be used(there are 4 
quadrants but only one has to be tabularized because all the 
others differ by sign and reversion(1 - x), it's a matter of 
figuring out the sign).


Of course it requires extra computation so it would be 
interesting to see the difference in performance for the extra 
logic.


Then there is exp

exp(x) can be written as exp(floor(x) + {x}) = 
exp(floor(x))*exp({x})


and so one can optimize this by tabulating exp(x) for 0<= x < 1 
which is the fractional part of exp(x).


Then tabulating it for a wide range of integers(probably in 2's).


e.g.,

exp(3.5) = exp(3)*exp(.5)

both come from a lookup table.

or one could do

exp(3) = exp(1 + 1 + 1) = exp(1)*exp(1)*exp(1)

(this requires iteration if we do not tabulate exp(3).

Hence one would limit the iteration count by tabulating things 
like exp(10^k) and exp(k) for -10 < k < 10.


The idea is basically one can get really dense LUT's for a small 
range that then are used to build the function for arbitrary 
inputs.


With linear interpolation one can get very accurate(for all 
practical purposes) LUT table methods that, if your code is 
right, is at least an order of magnitude faster. The memory 
requirements will be quite small with linear interpolation(and 
ideally quadratic or cubic if the costs are not too high).


That was what I was starting to work on before I got thrown off 
by it being much slower.


It seems you already have the half-sin done.

One could do things like sin, cos(obviously easy), exp, exp()^2, 
erf(x), sinh, cosh, etc. Things like sec could also be done as it 
would save a division since it seems they take about 30 cycles. 
But it would depend on the memory used.


[I can't mess with this now because I've gotta work other things 
at the moment]


Thanks.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 12:01:55 UTC, Alex wrote:
Well, the QuarterWave was suppose to generate just a quarter 
since that is all that is required for these functions due to 
symmetry and periodicity. I started with a half to get that 
working then figure out the sign flipping.


Sure, it is a tradeoff. You pollute the cache less this way, but 
you have to figure out the sign and the lookup-direction.


The trick is then to turn the phase into an unsigned integer then 
you get:


1. the highest bit will tell you that you need to use the inverse 
sign for the result.
2. the next highest bit will tell you that you need too look up 
in the reverse direction


What is key to performance here is that x86 can do many simple 
integer/bit operations in parallel, but only a few floating point 
operations.


Also avoid all conditionals. Use bitmasking instead, something 
along the line of:


const ulong phase = mantissa^((1UL<<63)-((mantissa>>62)&1));
const uint quarterphase = (phase>>53)&511;

(Haven't checked the correctness of this, but this shows the 
general principle.)


Ola.

Re: Performance of tables slower than built in?

2019-05-24 Thread Alex via Digitalmars-d-learn


On Friday, 24 May 2019 at 11:57:44 UTC, Alex wrote:

On Friday, 24 May 2019 at 08:13:00 UTC, Basilez B. wrote:

On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote:

On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it 
seems it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, 
and I'm surprised that nobody noticed it so far, is that 
YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. 
Just replace it with a static if (Method = "Linear") { 
/*...*/} else { /*...*/}


Also takes care to the type used. With DMD the implicit 
coercion of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.


Surely not?!?! Surely the compiler can optimize that switch 
since the value passed is CT? I thought the whole point of 
not having static switch(analogous to static if) was because 
it would go ahead and optimize these cases for us... and it's 
just a switch, just a jmp table.


Try by yourself but to be clear note that I don't like your 
attitude, which I find disrespectful.


Are you an idiot or on medications? My attitude? What did you 
want me to do? Suck your cock? My attitude? Seriously? WHERE? 
WHERE? You quoted everything?


I don't like your attitude? You seem to be extremely 
oversensitive and take things out of context?


It sounds like you just don't like people questioning you in 
any way shape or form even if you are wrong.


"Surely not?!?! Surely the compiler can optimize that switch
since the value passed is CT? I thought the whole point of not
having static switch(analogous to static if) was because it
would go ahead and optimize these cases for us... and it's
just a switch, just a jmp table."

Where? I seriously do not like your attitude though! You attack 
me when I said nothing offensive to you. You have serious 
problems. Get back on or get off the lithium.


[Now, of course, you get to see what an attitude really looks 
like... maybe I've taught you a bit about perspective? I doubt 
it.]


What amazes me is that you Basilez come in and start stuff and 
yet my response to you starting stuff will get deleted and yours 
won't... that is typical.


You really have mental issues. No where in my post was I hostile 
to you and yet you decided to have a hissy fit.  You are a 
snowflake that obviously need a ass kicking. That will teach you 
not to be so sensitive and interpret things in your own little 
cushioned wall universe.


See, no one was attacking you or questioning your intelligence... 
that is something you perceived and made up all on your own. But 
you will be supported by all the other people who love cushy 
walls and koolaid rather than getting the help you need.

Re: Performance of tables slower than built in?

2019-05-24 Thread Alex via Digitalmars-d-learn


On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad 
wrote:

On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
Either way, sin it's still twice as fast. Also, in the code 
the sinTab version is missing the writeln so it would have 
been faster.. so it is not being optimized out.


Well, when I run this modified version:

https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

on https://run.dlang.io/

then I get:

LUT:709
sin(x): 2761

So the LUT is 3-4 times faster even with your quarter-LUT 
overhead.


FWIW, as far as I can tell I managed to get the lookup version 
down to 104 by using bit manipulation tricks like these:


auto fastQuarterLookup(double x){
const ulong mantissa = cast(ulong)( (x - floor(x)) * 
(cast(double)(1UL<<63)*2.0) );
const double sign = 
cast(double)(-cast(uint)((mantissa>>63)&1));

… etc

So it seems like a quarter-wave LUT is 27 times faster than sin…

You just have to make sure that the generated instructions 
fills the entire CPU pipeline.



Well, the QuarterWave was suppose to generate just a quarter 
since that is all that is required for these functions due to 
symmetry and periodicity. I started with a half to get that 
working then figure out the sign flipping.


Essentially one just has to tabulate a quarter of sin, that is, 
from 0 to 90o and then get the sin right. This allows one to have 
4 times the resolution or 1/4 the size at the same cost.


Or, to put it another say, sin as 4 fold redundancy.

I'll check out your code, thanks for looking in to it.

Re: Performance of tables slower than built in?

2019-05-24 Thread Alex via Digitalmars-d-learn


On Friday, 24 May 2019 at 08:13:00 UTC, Basilez B. wrote:

On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote:

On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it 
seems it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, 
and I'm surprised that nobody noticed it so far, is that YOUR 
SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just 
replace it with a static if (Method = "Linear") { /*...*/} 
else { /*...*/}


Also takes care to the type used. With DMD the implicit 
coercion of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.


Surely not?!?! Surely the compiler can optimize that switch 
since the value passed is CT? I thought the whole point of not 
having static switch(analogous to static if) was because it 
would go ahead and optimize these cases for us... and it's 
just a switch, just a jmp table.


Try by yourself but to be clear note that I don't like your 
attitude, which I find disrespectful.


Are you an idiot or on medications? My attitude? What did you 
want me to do? Suck your cock? My attitude? Seriously? WHERE? 
WHERE? You quoted everything?


I don't like your attitude? You seem to be extremely 
oversensitive and take things out of context?


It sounds like you just don't like people questioning you in any 
way shape or form even if you are wrong.


"Surely not?!?! Surely the compiler can optimize that switch
since the value passed is CT? I thought the whole point of not
having static switch(analogous to static if) was because it
would go ahead and optimize these cases for us... and it's
just a switch, just a jmp table."

Where? I seriously do not like your attitude though! You attack 
me when I said nothing offensive to you. You have serious 
problems. Get back on or get off the lithium.


[Now, of course, you get to see what an attitude really looks 
like... maybe I've taught you a bit about perspective? I doubt 
it.]

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:

On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
Either way, sin it's still twice as fast. Also, in the code 
the sinTab version is missing the writeln so it would have 
been faster.. so it is not being optimized out.


Well, when I run this modified version:

https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

on https://run.dlang.io/

then I get:

LUT:709
sin(x): 2761

So the LUT is 3-4 times faster even with your quarter-LUT 
overhead.


FWIW, as far as I can tell I managed to get the lookup version 
down to 104 by using bit manipulation tricks like these:


auto fastQuarterLookup(double x){
const ulong mantissa = cast(ulong)( (x - floor(x)) * 
(cast(double)(1UL<<63)*2.0) );
const double sign = 
cast(double)(-cast(uint)((mantissa>>63)&1));

… etc

So it seems like a quarter-wave LUT is 27 times faster than sin…

You just have to make sure that the generated instructions fills 
the entire CPU pipeline.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:

for(int i = 0; i < res; i++)
QuarterSinTab[i] = sin(PI*(i/cast(double)res)); 


Btw, I think this creates a half sine, not a quarter, so you want 
(?):


 QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));


QuarterSinTab[$-1] = QuarterSinTab[0];


This creates a discontinuity if you create a quarter sine, in 
that case you probably wanted:


   QuarterSinTab[$-1] = sin(PI*0.5)

Otherwise you will never get 1 or -1.

But none of these affect performance.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:


https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c


I made an error here:

"return s*((1 - f)*QuarterSinTab[ai&511] + 
f*QuarterSinTab[(ai+1)&511]);"


Should of course be:

return s*((1 - f)*QuarterSinTab[ai&511] + 
f*QuarterSinTab[(ai&511)+1]);


However that does not impact the performance.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
Either way, sin it's still twice as fast. Also, in the code the 
sinTab version is missing the writeln so it would have been 
faster.. so it is not being optimized out.


Well, when I run this modified version:

https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c

on https://run.dlang.io/

then I get:

LUT:709
sin(x): 2761

So the LUT is 3-4 times faster even with your quarter-LUT 
overhead.

Re: Performance of tables slower than built in?

2019-05-24 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
So the LUT is 3-4 times faster even with your quarter-LUT 
overhead.


4-5 times faster actually, since I made the LUT size known at 
compiletime.

Re: Performance of tables slower than built in?

2019-05-24 Thread Basilez B. via Digitalmars-d-learn


On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote:

On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it 
seems it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, 
and I'm surprised that nobody noticed it so far, is that YOUR 
SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just 
replace it with a static if (Method = "Linear") { /*...*/} 
else { /*...*/}


Also takes care to the type used. With DMD the implicit 
coercion of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.


Surely not?!?! Surely the compiler can optimize that switch 
since the value passed is CT? I thought the whole point of not 
having static switch(analogous to static if) was because it 
would go ahead and optimize these cases for us... and it's just 
a switch, just a jmp table.


Try by yourself but to be clear note that I don't like your 
attitude, which I find disrespectful. I'm just here to help, I 
explained you the big problem you have in your code and you start 
discussing something that's not to be discussed AT ALL. Look at 
this https://d.godbolt.org/z/vtzVdp, in sinTab you'll be able to 
see


  callpure nothrow @nogc @safe int 
object.__switch!(immutable(char), "Linear", 
"Constant").__switch(scope const(immutable(char)[]))@PLT


and this even with LDC2 -O3. That's why your LUT is so slow. 
refactor the switch with "static if", the branch will be 
eliminated and you'll to see the perf improvement.

Re: Performance of tables slower than built in?

2019-05-23 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Thursday, 23 May 2019 at 21:50:38 UTC, Alex wrote:
I've used very small LUT's like a length of 5 and it didn't 
significantly change anything.


Use a size that is 2^n, then mask the index and hopefully that 
will turn off bounds checks.


E.g. If LUT size is 16, then index the lut with "i&15"?

I haven't tested this well but was just thrown off by the 
results as it should easily have been inverted and I expected 
quite a significant speed up(several factors) and not the 
reverse.


Well, you could take the time times clock frequency, divide it by 
number of iterations and calculate number of cycles per 
iteration. If it is more than a dozen, then something is wrong.

Re: Performance of tables slower than built in?

2019-05-23 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Thursday, 23 May 2019 at 21:30:47 UTC, Alex wrote:
I don't see how that can necessarily be faster. A LUT can give 
full 64-bit precision with one operation. The CORDIC needs 
iteration, at least 10 to be of any use. LUT's are precision 
independent assuming the creation cost is not included.


It isn't, IEEE sin is typically not fast. Arm cpus let you run 
fewer iterations though for nonstandard floats.


I have some code that uses sin, exp and a few other primitive 
algebraic functions and in one case the code is extremely 
slow(it uses exp).  I haven't done much testing of all this but 
something just seems off somewhere.


Dont know about exp, but some operations are slow when you get 
too close to zero, so called denormal numbers.


I guess there is no tool that can tell one exactly what is 
happening to a piece of code in the cpu... basically an 
instruction level profiler?


Vtune from Intel, not free... Afaik.

Re: Performance of tables slower than built in?

2019-05-23 Thread Alex via Digitalmars-d-learn

On Thursday, 23 May 2019 at 18:57:03 UTC, Ola Fosheim Grøstad 
wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


Not when I tried it with one of the online compilers, LUT is 
3-4 times faster on your tight inner loop, but I guess it 
depends on the CPU.


LUT should be very fast in long-running tight inner loops like 
that once the cache is warm with such as small LUT that fits in 
working memory (cache).


I've used very small LUT's like a length of 5 and it didn't 
significantly change anything.


Unless it is being hit hard at the start and that is skewing the 
results, it doesn't seem to be the issue.


I haven't tested this well but was just thrown off by the results 
as it should easily have been inverted and I expected quite a 
significant speed up(several factors) and not the reverse.

Re: Performance of tables slower than built in?

2019-05-23 Thread Alex via Digitalmars-d-learn

On Thursday, 23 May 2019 at 19:17:40 UTC, Ola Fosheim Grøstad 
wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:

xxx = 0;
sw.reset();
sw.start();
for(double i = 0; i < 1000; i++) xxx += sin(PI*i);
t = sw.peek().msecs;
writeln(t);
sw.stop();
}


What you are doing wrong is that xxx is never used... So DMD 
removes it altogether?


If you add writeln(xxx) after the second loop as well, then 
maybe the measurements make more sense?


Ola.


maybe, I thought I put in the writeln(xxx) in there to do that... 
yeah, looking at my code it's there, so that is not the problem.


This is the code I had,

import std.datetime;
double xxx = 0;
StopWatch sw;
sw.start();
for(double i = 0; i < 1000; i++) xxx += sinTab(i);
auto t = sw.peek().msecs;
writeln(t);
sw.stop();
writeln(xxx);

xxx = 0;
sw.reset();
sw.start();
for(double i = 0; i < 1000; i++) xxx += sin(PI*i);
t = sw.peek().msecs;
writeln(t);
sw.stop();
writeln(xxx);

Either I modified it later or deleted one of those lines or the 
editor screwed up when I pasted it... I don't recall touching the 
code afterwards.


Either way, sin it's still twice as fast. Also, in the code the 
sinTab version is missing the writeln so it would have been 
faster.. so it is not being optimized out.

Re: Performance of tables slower than built in?

2019-05-23 Thread Alex via Digitalmars-d-learn


On Thursday, 23 May 2019 at 15:20:22 UTC, Timon Gehr wrote:

On 23.05.19 12:21, Alex wrote:

On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it 
seems it's slower than the function itself?!?


There's intrinsic cpu instructions for some of those that can 
do the math faster than waiting on memory access.


It is quite likely calculating it is actually faster. Even 
carefully written and optimized tables tend to just have a 
very small win relative to the cpu nowadays.


Surely not? I'm not sure what method is used to calculate them 
and maybe a table method is used internally for the common 
functions(maybe the periodic ones) but memory access surely is 
faster than multiplying doubles?

...


Depends on what kind of memory access, and what kind of faster. 
If you hit L1 cache then a memory access might be (barely) 
faster than a single double multiplication. (But modern 
hardware usually can do multiple double multiplies in parallel, 
and presumably also multiple memory reads, using SIMD and/or 
instruction-level parallelism.)


I think a single in-register double multiplication will be 
roughly 25 times faster than an access to main memory. Each 
access to main memory will pull an entire cache line from main 
memory to the cache, so if you have good locality (you usually 
won't with a LUT), your memory accesses will be faster on 
average. There are a lot of other microarchitectural details 
that can matter quite a lot for performance.




I realize a lot of optimizations could be made but I still find 
it hard to believe that it could be done faster even with special 
hardware unless a LUT is used in the hardware already or some 
highly optimized algorithms.


The whole point of general purpose computing was to get away from 
specialized hardware because it was cheaper and one could spend 
the resources in improving raw cycle performance which would 
benefit the entire system.



But I guess because of complexity of hardware now days it's hard 
to really figure out what is going on.


http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

Maybe when it is an "intrinsic" it can avoid some of the issues 
that might otherwise cost in a LUT(such as a context switch).





And most of the time these functions are computed by some 
series that requires many terms. I'd expect, say, to compute 
sin one would require at least 10 multiplies for any 
accuracy... and surely that is much slower than simply 
accessing a table(it's true that my code is more complex due 
to the modulos and maybe that is eating up the diff).


Do you have any proof of your claims? Like a paper that 
discusses such things so I can see what's really going on and 
how they achieve such performance(and how accurate)?


Not exactly what you asked, but this might help:
https://www.agner.org/optimize

Also, look up the CORDIC algorithm.


I don't see how that can necessarily be faster. A LUT can give 
full 64-bit precision with one operation. The CORDIC needs 
iteration, at least 10 to be of any use. LUT's are precision 
independent assuming the creation cost is not included.


I couldn't find anything in that link that specifically addressed 
speeding up typical math functions.


It would be nice to know exactly what is going on, which 
functions are optimized and the typical speed up over a lut.


I have some code that uses sin, exp and a few other primitive 
algebraic functions and in one case the code is extremely slow(it 
uses exp).  I haven't done much testing of all this but something 
just seems off somewhere. Either the sin is heavily optimized and 
exp is not(which it too have a CORDIC like algorithm since 
exp(a+b) = exp(a)*exp(b)) or something else is happening that I'm 
not aware of.. which is why I went to tabularizing these 
functions in the first place.


Also, I don't know the kinda accuracy they have, which doesn't 
help much but I could use a more accurate algorithm which is 
slower since it is pre-computed.


That site I linked does talk about some of the issues and 
stuff... I guess I just need to update my thinking on modern cpus 
a little better.



I guess there is no tool that can tell one exactly what is 
happening to a piece of code in the cpu... basically an 
instruction level profiler?

Re: Performance of tables slower than built in?

2019-05-23 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:

xxx = 0;
sw.reset();
sw.start();
for(double i = 0; i < 1000; i++) xxx += sin(PI*i);
t = sw.peek().msecs;
writeln(t);
sw.stop();
}


What you are doing wrong is that xxx is never used... So DMD 
removes it altogether?


If you add writeln(xxx) after the second loop as well, then maybe 
the measurements make more sense?


Ola.

Re: Performance of tables slower than built in?

2019-05-23 Thread Ola Fosheim Grøstad via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


Not when I tried it with one of the online compilers, LUT is 3-4 
times faster on your tight inner loop, but I guess it depends on 
the CPU.


LUT should be very fast in long-running tight inner loops like 
that once the cache is warm with such as small LUT that fits in 
working memory (cache).

Re: Performance of tables slower than built in?

2019-05-23 Thread Timon Gehr via Digitalmars-d-learn


On 23.05.19 12:21, Alex wrote:

On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential routines 
to speed up some code by using tables... but it seems it's slower 
than the function itself?!?


There's intrinsic cpu instructions for some of those that can do the 
math faster than waiting on memory access.


It is quite likely calculating it is actually faster. Even carefully 
written and optimized tables tend to just have a very small win 
relative to the cpu nowadays.


Surely not? I'm not sure what method is used to calculate them and maybe 
a table method is used internally for the common functions(maybe the 
periodic ones) but memory access surely is faster than multiplying doubles?

...


Depends on what kind of memory access, and what kind of faster. If you 
hit L1 cache then a memory access might be (barely) faster than a single 
double multiplication. (But modern hardware usually can do multiple 
double multiplies in parallel, and presumably also multiple memory 
reads, using SIMD and/or instruction-level parallelism.)


I think a single in-register double multiplication will be roughly 25 
times faster than an access to main memory. Each access to main memory 
will pull an entire cache line from main memory to the cache, so if you 
have good locality (you usually won't with a LUT), your memory accesses 
will be faster on average. There are a lot of other microarchitectural 
details that can matter quite a lot for performance.


And most of the time these functions are computed by some series that 
requires many terms. I'd expect, say, to compute sin one would require 
at least 10 multiplies for any accuracy... and surely that is much 
slower than simply accessing a table(it's true that my code is more 
complex due to the modulos and maybe that is eating up the diff).


Do you have any proof of your claims? Like a paper that discusses such 
things so I can see what's really going on and how they achieve such 
performance(and how accurate)?


Not exactly what you asked, but this might help:
https://www.agner.org/optimize

Also, look up the CORDIC algorithm.

Re: Performance of tables slower than built in?

2019-05-23 Thread Alex via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


There's intrinsic cpu instructions for some of those that can 
do the math faster than waiting on memory access.


It is quite likely calculating it is actually faster. Even 
carefully written and optimized tables tend to just have a very 
small win relative to the cpu nowadays.


Surely not? I'm not sure what method is used to calculate them 
and maybe a table method is used internally for the common 
functions(maybe the periodic ones) but memory access surely is 
faster than multiplying doubles?


And most of the time these functions are computed by some series 
that requires many terms. I'd expect, say, to compute sin one 
would require at least 10 multiplies for any accuracy... and 
surely that is much slower than simply accessing a table(it's 
true that my code is more complex due to the modulos and maybe 
that is eating up the diff).


Do you have any proof of your claims? Like a paper that discusses 
such things so I can see what's really going on and how they 
achieve such performance(and how accurate)?

Re: Performance of tables slower than built in?

2019-05-23 Thread Alex via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, and 
I'm surprised that nobody noticed it so far, is that YOUR 
SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just 
replace it with a static if (Method = "Linear") { /*...*/} else 
{ /*...*/}


Also takes care to the type used. With DMD the implicit 
coercion of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.


Surely not?!?! Surely the compiler can optimize that switch since 
the value passed is CT? I thought the whole point of not having 
static switch(analogous to static if) was because it would go 
ahead and optimize these cases for us... and it's just a switch, 
just a jmp table.

Re: Performance of tables slower than built in?

2019-05-22 Thread Basile B. via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, and 
I'm surprised that nobody noticed it so far, is that YOUR 
SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just 
replace it with a static if (Method = "Linear") { /*...*/} else 
{ /*...*/}


Oh no... I meant "==" obviously

Also takes care to the type used. With DMD the implicit 
coercion of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.

Re: Performance of tables slower than built in?

2019-05-22 Thread Basile B. via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


[...]


Hi, lookup tables ARE faster but the problem you have here, and 
I'm surprised that nobody noticed it so far, is that YOUR SWITCH 
LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it 
with a static if (Method = "Linear") { /*...*/} else { /*...*/}


Also takes care to the type used. With DMD the implicit coercion 
of float and double can lead to extra conversions.


You'll directly see a 15% gain after refactoring the switch.

Re: Performance of tables slower than built in?

2019-05-21 Thread matheus via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:

On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


There's intrinsic cpu instructions for some of those that can 
do the math faster than waiting on memory access.


It is quite likely calculating it is actually faster. Even 
carefully written and optimized tables tend to just have a very 
small win relative to the cpu nowadays.


Exactly, and by the way this is old feature already.

I remember like long time ago when I was studying/writing some 
games and I decided to test pre-computed arrays (LUT) for SIN/COS 
vs functions, and the later would beat the arrays pretty easily.


And by the way when porting old games, sometimes you usually (If 
not change the game logic too much), get rid of the LUT and use 
functions directly.


Matheus.

Re: Performance of tables slower than built in?

2019-05-21 Thread rikki cattermole via Digitalmars-d-learn


On 22/05/2019 12:22 PM, JS wrote:
I am trying to create some fast sin, sinc, and exponential routines to 
speed up some code by using tables... but it seems it's slower than the 
function itself?!?



The code below uses matplotlibd but is not necessary. The timing code is 
lower.
Ideally I'd like to have quadratic interpolation and all that but if the 
code is going to be slower and less accurate then what is the point?


Surely I'm doing something wrong?

I wasn't able to test this with LDC because it's giving me linker errors 
for some reason.


I'm also curious as to what D is using internally in the first place.


It should be the libc's.

With regards to your code, it looks like your computation is wrong.
I gave it a go (although refactored) and it just wouldn't output the 
right set of values. So performance isn't all that important right now.

Re: Performance of tables slower than built in?

2019-05-21 Thread Adam D. Ruppe via Digitalmars-d-learn


On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


There's intrinsic cpu instructions for some of those that can do 
the math faster than waiting on memory access.


It is quite likely calculating it is actually faster. Even 
carefully written and optimized tables tend to just have a very 
small win relative to the cpu nowadays.

Performance of tables slower than built in?

2019-05-21 Thread JS via Digitalmars-d-learn

I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?



The code below uses matplotlibd but is not necessary. The timing 
code is lower.
Ideally I'd like to have quadratic interpolation and all that but 
if the code is going to be slower and less accurate then what is 
the point?


Surely I'm doing something wrong?

I wasn't able to test this with LDC because it's giving me linker 
errors for some reason.


I'm also curious as to what D is using internally in the first 
place.


The generation uses a quarter sin since the waveform is 
symmetric. This does slow down the code a little but I can't 
imagine the few ops it uses to do this is more costly than 
whatever is used for computing sin.






import std.math;
import std.range;
import std.random;
import std.algorithm;
import plt = matplotlibd.pyplot;
import std.stdio, core.stdc.stdlib;


__gshared double[] QuarterSinTab;

auto GenQuarterSinTab(int res = 4*100)
{
	res++;	// Add extra endpoint to make linear interpolation 
easier/faster

alias T = typeof(QuarterSinTab[0]);
if (QuarterSinTab !is null) free(QuarterSinTab.ptr);
QuarterSinTab = (cast(T*)malloc(T.sizeof*res))[0..res];

res--;
for(int i = 0; i < res; i++)
QuarterSinTab[i] = sin(PI*(i/cast(double)res)); 

QuarterSinTab[$-1] = QuarterSinTab[0];
}

/*
	Uses the tabularized quarter sine and extends it to the full 
range

*/
auto sinTab(string Method = "Linear")(typeof(QuarterSinTab[0]) x)
{
alias T = typeof(QuarterSinTab[0]);
auto len = QuarterSinTab.length - 1;
auto s = -1;
auto m = x - cast(int)x;
auto m2 = 1;

if (x < 0) { m = m + 1; s = 1; }
if ((abs(x) % 2) < 1) s = -s;
auto a = m*len;
auto ai = cast(int)(m*len);

switch(Method)
{   
default:
case "Linear":
auto f = a - ai;
return s*((1 - f)*QuarterSinTab[ai] + 
f*QuarterSinTab[ai+1]);
case "Constant":
return s*QuarterSinTab[ai];

}
}

void main() {
GenQuarterSinTab;

auto x = iota(-10, 10, 0.01).map!(x => x * PI);
auto y = x.map!("sin(PI*a)");
auto z = x.map!((x){ return sin(PI*x) - sinTab(x);});
//plt.plot(x, y, "r-", ["label": "$y=sin(x)$"]);
plt.plot(x, z, "b-", ["label": "$y$"]);
plt.xlim(-5, 5);
plt.ylim(-0.001, 0.001);
plt.legend();
plt.show();

import std.datetime;
double xxx = 0;
StopWatch sw;
sw.start();
for(double i = 0; i < 1000; i++) xxx += sinTab(i);
auto t = sw.peek().msecs;
writeln(t);
sw.stop();
writeln(xxx);

xxx = 0;
sw.reset();
sw.start();
for(double i = 0; i < 1000; i++) xxx += sin(PI*i);
t = sw.peek().msecs;
writeln(t);
sw.stop();
}

44 matches

Mail list logo